IIT-Patna Academic Alexa

7 minute read

Published: December 22, 2020

basic

In this post, I will take you through the Chatbot IIT-Patna Academic Alexa which is a sentiment-aware intelligent information retrieval based system for the academic portal of IIT-P.

Chatbots are one of the best applications of Natural Language Processing in improving human-computer interaction. The idea of the first chatbot was conceived in 1964, and since then, the development and usage of chatbots has sky-rocketed, owing to the advancements in computing, deep learning and deployable machine learning. Inspired from many such applications we build a novel chatbot for information retrieval for IIT-Patna’s academic data.

Motivation

Imagine you have a huge database consisting of the academic and personal details of students enrolled in your college/ organization. Now, we want to retrieve certain data like the marks of a student in a particular subject, or the credits of a subject, the number of subjects taught in a particular year etc. How do we generally do it? Simple! Just open it in a csv file and search for the parameters manually. As you have guessed that this will be a tedious and unfeasible task.

Now, what if we store the database in the form of MySQL relational tables and then write appropriate queries for performing the task? Sounds easy right? But is everyone computer literate? Can everyone write MySQL queries for retrieval of information for the data?

To address this task we set out to come up with a chatbot which takes Natural Language of the users as input and converts it into the appropriate SQL query and automatically fetches and displays the output. We design an entire chatbot pipeline for performing the given task.

Chatbot Pipeline

pipeline_diagram

Components of the Custom-Designed Pipeline of IIT-P Academic Alexa: -

Input Module
Sentence Classifier
NL to SQL Engine
NLTK Regular Chat
Sentence Similarity Module
Feedback Module
Continual Learning System
Sentiment Analyzer

pipeline_classifier

Sentence Classifier Module

Each query to the chatbot belongs to either database query or non-database query category.
Database queries are those which are addressed for information retrieval from the database.
Non-Database queries are those which conduct normal-conversation like Hi, How are you? Thank You etc.
Sentence Classifier Module is Deep Learning Based Binary Classifier for classifying as “Database” or “Non-Database” query
Multi-Channel Convolutional Neural Network Model coded in Tensorflow
Self-Prepared and annotated the train and the test set
Model used for real-time inferencing (lightweight for deployability)
Capable of adapting to Continual Learning (or Online Learning Environment)

Model Specifications

multicnn_diagram

Motivation of using Multi-CNN

multicnn_demonstration

CNNs capture local features of the input
Input to sentence classifier in our case depends of local feature mapping
Local features in textual context can be visualized as n-gram based features
Inherit n-gram feature modelling in multi-cnn is hence useful.

The sentence classifier mentioned here is trained on the self-procured and annotated train set and evaluated on the test set. It acheives about 93% accuracy.

pipeline_nlsql

NL to SQL Engine

Our module: -

Converts Natural Language Data into SQL (Structured-Query-Language) for information retrieval from the database
Works on self-developed algorithms based on the “Dependency-Tree-Parser” of the natural language query
Works on questions of type: -
- List
- Which
- What
- How Many
- Who
Algorithm proposed here is extendable

For coming up with the idea, the following steps were performed: -

Carried out literature search on traditional methods for NL to SQL conversion
NL to SQL is an open-ended research problem in the field of lexical semantics and semantic parsing
Decided to use the “Syntax” and “Lexical Semantics” of the natural language data
Hypothesized about using Part-of-Speech-Tagging
Investigated the problem by using “Dependency Tree Parsers”
Developed a basic tree-parsing algorithm for the questions.
Implemented a DFS (Depth-First-Search) approach for certain type of questions

The NL-SQL part of the project was done by me and my classmate Vaibhav.

Sample pseudocode for “list question” conversion is given below

list_question

Implementation Details

Explored different options for libraries like nltk and spaCy
Decided to use spaCy library for obtaining the Dependency Tree of the natural language query
spaCy is
- A Free Open Source Natural Language Processing Library in Python
- Offers modules for NER, POS, Sentence Similarity and Text Classification
Reasons for using spaCy
- Lightweight and easy to use
- Easily integratable with python
- Provides robust results as compared to traditional text processing libraries.
spaCy offers the fastest syntactic parser in the world of accuracy within 1% of the best available (Choi et al., 2015)

Python SQL Connector

Part of the NL-SQL converter engine
Takes as input the SQL query generated from the natural language data
Establishes connection with the MySQL server in the local machine (where the database is stored)
Fetches and Processes the returned tuple structure for forming the appropriate output

Implementation Details:

Use mysql connector: a standard database driver provided by MySQL
Step 1: Authorization
Step 2: Creating a cursor pointing to the database
Step 3: Provide the query in form of string and process the returned output

pipeline_nltk

NLTK Chat

Based on NLTK (Natural Language Toolkit) library
Chat Provision for Regular (Non-Database) queries like Hello, Thank You etc.
Based on simple regex (Regular Expression) matching of queries
Pairs variable coded by the programmer contains predefined regex query -> output mapping
Reflections variable contains pronouns mapping from question to answer
Lightweight and easy to use
Helps make the chatbot more interesting by incorporating daily life conversations

pipeline_feedback

Feedback Module

Practical software systems cannot be infallible. Feedback is the means of striving towards perfection.
Two steps:-
- Sentence Similarity
- Data Collection

Sentence Similarity Module

sentence_similarity

Feedback Data Collection

Continuous data is collected from the users/ testers

Sentence Similarity module: -
- Misclassification of input query is tackled
- The specific text followed by the feedback label is recorded
- Feedback label is either database or non-database tag labelled by the user
NL-SQL Module:-
- NL-SQL misconversion is tackled
- The input query is recorded in a csv file and the file is forwarded to the code maintainer
NLTK Chat Module: -
- Tackles Regular chat which is not present in the pre-defined chat pair
- The input query is recorded in a csv file and the file is forwarded to the code maintainer

Continual Learning/ Online Training

Ability of a Machine Learning model to learn continually from a stream of data
Is of prime importance in production environments
User feedback is stored in a separate csv file
Sentence Classifier module learns from the collected feedback data at regular intervals of time
The classifier learns with time and hence becomes better

pipeline_sentiment

Sentiment Analysis Module

We carry out sentiment analysis of all users using the chatbot platform
Sentiment of students when chatting with the chatbot can give various insights
- Mental state of the student
- Overall satisfaction of group of students with the results/marks obtained

Implementation done using Vader-Sentiment-Analysis

VADER is a rule-based sentiment analysis tool which does lexicon based scoring
Sentiment scores are between -1 and 1
Sentiment score of:
- 0 is neutral
- Less than -0.05 is negative
- Greater than 0.05 is positive
Sentiment of all sentences are averaged to determine the overall sentiment of the user

Hence this completes the summary of all the modules used in the chatbot pipeline.

db_query

This was a team project and here in this blog I have taken you through my part of the project. There is much more to this like the final interface, data analysis etc. which was done by my classmates at IIT-Patna.

You can find the code for this project on GitHub here.

Stay tuned for more ML and DL content!

Share on

Twitter Facebook LinkedIn

Nischal Ashok Kumar