IIT-Patna Academic Alexa

7 minute read

Published:

basic

In this post, I will take you through the Chatbot IIT-Patna Academic Alexa which is a sentiment-aware intelligent information retrieval based system for the academic portal of IIT-P.

Chatbots are one of the best applications of Natural Language Processing in improving human-computer interaction. The idea of the first chatbot was conceived in 1964, and since then, the development and usage of chatbots has sky-rocketed, owing to the advancements in computing, deep learning and deployable machine learning. Inspired from many such applications we build a novel chatbot for information retrieval for IIT-Patna’s academic data.

Motivation

Imagine you have a huge database consisting of the academic and personal details of students enrolled in your college/ organization. Now, we want to retrieve certain data like the marks of a student in a particular subject, or the credits of a subject, the number of subjects taught in a particular year etc. How do we generally do it? Simple! Just open it in a csv file and search for the parameters manually. As you have guessed that this will be a tedious and unfeasible task.

Now, what if we store the database in the form of MySQL relational tables and then write appropriate queries for performing the task? Sounds easy right? But is everyone computer literate? Can everyone write MySQL queries for retrieval of information for the data?

To address this task we set out to come up with a chatbot which takes Natural Language of the users as input and converts it into the appropriate SQL query and automatically fetches and displays the output. We design an entire chatbot pipeline for performing the given task.

Chatbot Pipeline

pipeline_diagram

Components of the Custom-Designed Pipeline of IIT-P Academic Alexa: -

  1. Input Module
  2. Sentence Classifier
  3. NL to SQL Engine
  4. NLTK Regular Chat
  5. Sentence Similarity Module
  6. Feedback Module
  7. Continual Learning System
  8. Sentiment Analyzer

pipeline_classifier

Sentence Classifier Module

  • Each query to the chatbot belongs to either database query or non-database query category.
  • Database queries are those which are addressed for information retrieval from the database.
  • Non-Database queries are those which conduct normal-conversation like Hi, How are you? Thank You etc.
  • Sentence Classifier Module is Deep Learning Based Binary Classifier for classifying as “Database” or “Non-Database” query
  • Multi-Channel Convolutional Neural Network Model coded in Tensorflow
  • Self-Prepared and annotated the train and the test set
  • Model used for real-time inferencing (lightweight for deployability)
  • Capable of adapting to Continual Learning (or Online Learning Environment)

Model Specifications

multicnn_diagram

Motivation of using Multi-CNN

multicnn_demonstration

  • CNNs capture local features of the input
  • Input to sentence classifier in our case depends of local feature mapping
  • Local features in textual context can be visualized as n-gram based features
  • Inherit n-gram feature modelling in multi-cnn is hence useful.

The sentence classifier mentioned here is trained on the self-procured and annotated train set and evaluated on the test set. It acheives about 93% accuracy.

pipeline_nlsql

NL to SQL Engine

Our module: -

  • Converts Natural Language Data into SQL (Structured-Query-Language) for information retrieval from the database
  • Works on self-developed algorithms based on the “Dependency-Tree-Parser” of the natural language query
  • Works on questions of type: -
    • List
    • Which
    • What
    • How Many
    • Who
  • Algorithm proposed here is extendable

For coming up with the idea, the following steps were performed: -

  • Carried out literature search on traditional methods for NL to SQL conversion
  • NL to SQL is an open-ended research problem in the field of lexical semantics and semantic parsing
  • Decided to use the “Syntax” and “Lexical Semantics” of the natural language data
  • Hypothesized about using Part-of-Speech-Tagging
  • Investigated the problem by using “Dependency Tree Parsers”
  • Developed a basic tree-parsing algorithm for the questions.
  • Implemented a DFS (Depth-First-Search) approach for certain type of questions

The NL-SQL part of the project was done by me and my classmate Vaibhav.

Sample pseudocode for “list question” conversion is given below

list_question

Implementation Details

  • Explored different options for libraries like nltk and spaCy
  • Decided to use spaCy library for obtaining the Dependency Tree of the natural language query
  • spaCy is
    • A Free Open Source Natural Language Processing Library in Python
    • Offers modules for NER, POS, Sentence Similarity and Text Classification
  • Reasons for using spaCy
    • Lightweight and easy to use
    • Easily integratable with python
    • Provides robust results as compared to traditional text processing libraries.
  • spaCy offers the fastest syntactic parser in the world of accuracy within 1% of the best available (Choi et al., 2015)

Python SQL Connector

  • Part of the NL-SQL converter engine
  • Takes as input the SQL query generated from the natural language data
  • Establishes connection with the MySQL server in the local machine (where the database is stored)
  • Fetches and Processes the returned tuple structure for forming the appropriate output

Implementation Details:

  • Use mysql connector: a standard database driver provided by MySQL
  • Step 1: Authorization
  • Step 2: Creating a cursor pointing to the database
  • Step 3: Provide the query in form of string and process the returned output

pipeline_nltk

NLTK Chat

  • Based on NLTK (Natural Language Toolkit) library
  • Chat Provision for Regular (Non-Database) queries like Hello, Thank You etc.
  • Based on simple regex (Regular Expression) matching of queries
  • Pairs variable coded by the programmer contains predefined regex query -> output mapping
  • Reflections variable contains pronouns mapping from question to answer
  • Lightweight and easy to use
  • Helps make the chatbot more interesting by incorporating daily life conversations

pipeline_feedback

Feedback Module

  • Practical software systems cannot be infallible. Feedback is the means of striving towards perfection.
  • Two steps:-
    • Sentence Similarity
    • Data Collection

Sentence Similarity Module

sentence_similarity

Feedback Data Collection

Continuous data is collected from the users/ testers

  1. Sentence Similarity module: -
    • Misclassification of input query is tackled
    • The specific text followed by the feedback label is recorded
    • Feedback label is either database or non-database tag labelled by the user
  2. NL-SQL Module:-
    • NL-SQL misconversion is tackled
    • The input query is recorded in a csv file and the file is forwarded to the code maintainer
  3. NLTK Chat Module: -
    • Tackles Regular chat which is not present in the pre-defined chat pair
    • The input query is recorded in a csv file and the file is forwarded to the code maintainer

Continual Learning/ Online Training

  • Ability of a Machine Learning model to learn continually from a stream of data
  • Is of prime importance in production environments
  • User feedback is stored in a separate csv file
  • Sentence Classifier module learns from the collected feedback data at regular intervals of time
  • The classifier learns with time and hence becomes better

pipeline_sentiment

Sentiment Analysis Module

  • We carry out sentiment analysis of all users using the chatbot platform
  • Sentiment of students when chatting with the chatbot can give various insights
    • Mental state of the student
    • Overall satisfaction of group of students with the results/marks obtained

Implementation done using Vader-Sentiment-Analysis

  • VADER is a rule-based sentiment analysis tool which does lexicon based scoring
  • Sentiment scores are between -1 and 1
  • Sentiment score of:
    • 0 is neutral
    • Less than -0.05 is negative
    • Greater than 0.05 is positive
  • Sentiment of all sentences are averaged to determine the overall sentiment of the user

Hence this completes the summary of all the modules used in the chatbot pipeline.

db_query

This was a team project and here in this blog I have taken you through my part of the project. There is much more to this like the final interface, data analysis etc. which was done by my classmates at IIT-Patna.

You can find the code for this project on GitHub here.

Stay tuned for more ML and DL content!