Building a Chatbot using TensorFlow
Sai Nishanth Dilly
Graduate Student
School of Computing and Informatics
University of North Carolina at Charlotte
E-mail: sdilly@uncc.edu
Abstract— The term “Chatterbot” was originally coined by Michael Mauldin in 1994. Chatbot(Chatterbot), an artificial conversational entity is a computer program which makes the conversation possible through textual or auditory methods. Chatbots have become extremely popular in 2016. Many Social Media messaging platforms like Messenger, WeChat, Skype and many others are now hosting chatbots which have been built by the developers. We now have bots available for almost everything. For instance, we can chat with CNN bot to know the news, we can now chat with Microsoft bot to get assistance about Microsoft products, we can chat with an Amazon bot to get products delivered right to our home. So, how does a chatbot do all this? Some chatbots make use of sophisticated Natural Language Processing systems while many other simpler chat bots just search for keywords within the input and then pull a reply with the most matching keywords from a database. But with the advent of Deep Learning, Chatbots make use of End-to-End Machine Learning, i.e these chatbots just use one system that has been trained on only one dataset and these chatbots makes no assumptions about the use case. So, I am going to build a chatbot in TensorFlow using Recurrent Neural Networks. This is only related to a particular field of data (movies, sports, business, science etc.) and trained. For this, I’m using the Movie-Dialogs Dataset compiled by the Cornell University, which contains the conversations between characters from 600+ movies and then splitting the dataset into two different sets for training and testing. One set is the text from one side of the conversation and the other set is the response from another side. The next step is to properly format the data for building the model, this can be achieved by making use of tokenization. For creating the model, I will be making use of TensorFlow’s sequence to sequence model with embedding attention mechanism. After creating the model, we can train the model and save it as a checkpoint file. It will take few hours for training the model. Finally, we can test the model by giving text.
I. INTRODUCTION
Motivation:
TENSORFLOW
TensorFlow is an open-source machine learning library developed by the Google Brain team and released in November 2015. It relies on the construction of dataflow graphs with nodes that represent mathematical operations (ops) and edges that represent tensors (multidimensional arrays represented internally as numpy ndarrays). The dataflow graph is a summary of all the computations that are executed asynchronously and in parallel within a TensorFlow session on a given device (CPUs or GPUs). The library boasts true portability between devices, meaning that the same code can be used to run models in CPU-only or heterogeneous GPU-accelerated environments.
TensorFlow relies on highly optimized C++ for computation and supports APIs in C and C++ in addition to the Python API used to build this ChatBot.
ChatBot:
Chatbots represent a potential shift in how people interact with data and services online. While there is currently a surge of interest in chatbot design and development, we lack knowledge about why people use chatbots.Chatbots are machine agents that serve as natural language user interfaces for data and service providers. Currently, chatbots are typically designed and developed for mobile messaging applications.
.Designing a new interactive technology such as a chatbot requires in-depth knowledge of users’ motivations for using the technology, which allows the designer to overcome challenges regarding the adoption of the technology. More general knowledge is also needed to understand human–chatbot relationships. To our knowledge, no studies to date have investigated users’ motivations for interacting with chatbots.As a first step towards bridging this knowledge gap, we perform a study addressing the following research question:
RQ: Why do people use chatbots?
The study contributes new knowledge regarding individuals’ motivations for using chatbots based on an online questionnaire completed by US chatbot users. The questionnaire includes an open question regarding the participants’ main motivations for using chatbots. The findings obtained using this approach can inform future designs intended to improve human–chatbot interactions.
Before we present the findings, we will first describe the relevant background for our study. We then present the method and findings of the study. In the discussion, we address the implications of the study’s findings for the design and development of chat- bots.
Models for generating chatbots
Rule based models make it easy for anyone to create a bot. But it is incredibly difficult to create a bot that answers complex queries. The pattern matching is kind of weak and hence, AIML based bots suffer when they encounter a sentence that doesn’t contain any known patterns. Also, it is time consuming and takes a lot of effort to write the rules manually. What if we can build a bot that learns from existing conversations (between humans). This is where Machine Learning comes in.
Let us call these models that automatically learn from data, Intelligent models. The Intelligent models can be further classified into:
Retrieval-based models
Generative models
The Retrieval-based models pick a response from a collection of responses based on the query. It does not generate any new sentences, hence we don’t need to worry about grammar. The Generative models are quite intelligent. They generate a response, word by word based on the query. Due to this, the responses generated are prone to grammatical errors. These models are difficult to train, as they need to learn the proper sentence structure by themselves. However, once trained, the generative models outperform the retrieval-based models in terms of handling previously unseen queries and create an impression of talking with a human (a toddler may be) for the user.
Read Deep Learning For Chatbots by Denny Britz where he talks about the length of conversations, open vs closed domain dialogs, challenges in generative models like Context based responses, Coherent Personality, understanding the Intention of user and how to evaluate these models.
PROBLEM DEFINITION:
The problem is extracting the features form the reviews which are explicit is a difficult task because we need to disambiguate the sense of the reviews and then performing sentiment analysis. The main challenge here is to recognize the false reviews which doesn’t mean the same as they look when we read. A human being can read and understand weather it is a false review or a genuine review, but it is not same with automated machine or tool. More over after finding the state of review then we go for the repetition of the review i.e., how many times it is repeated and finding which product is trending. So, this will be the problem we need to look in for.
OBJECTIVE OF THE PROJECT:
At the highest level, the system accomplishes the following tasks:
1. Gather reviews about the product from online websites.
2. Select a set of product features to rate on.
3. Determine the ratings for the selected features based on the sentiment of the sentence in which it appears.
4. Summarize the ratings for the features as the total number of positive and negative points for each of the review.
5. Draw a graph based on the summarized ratings and depicting the trend of a feature.
Dataset
The bot comes with the script to do the pre-processing for the Cornell Movie-Dialogs Corpus, created by the wonderful Cristian Danescu-Niculescu-Mizil and Lillian Lee at, guess where, Cornell University. This is an extremely well-formatted dataset of dialogues from movies. It has 220,579 conversational exchanges between 10,292 pairs of movie characters, involving 9,035 characters from 617 movies with 304,713 total utterances.
The corpus is distributed together with the paper “Chameleons in Imagined Conversations: A new Approach to Understanding Coordination of Linguistic Style in Dialogs.”, which was featured on Nature.com. It is a fascinating paper that highlights several cognitive biases in conversations that will help you make your chatbot more realistic. I highly recommend that you read it.
The preprocessing is pretty basic. I consider most of the punctuations as separate tokens. I normalize all digits to ‘#’. I lowercase everything. I noticed that the dialogs have a lot of <u> and </u>, as well as [ and ], so I just get rid of those. You’re welcome to experiment with other ways to pre-process your data.
The model
The chatbot is based on the translate model on the TensorFlow repository , with some modification to make it work for a chatbot. It’s a sequence to sequence model with attention decoder. I. The encoder is a single utterance, and the decoder is the response to that utterance. The chatbot is built using a wrapper function for the sequence to sequence model with bucketing.
Seq2Seq
Sequence To Sequence model introduced in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation has since then, become the Go-To model for Dialogue Systems and Machine Translation. It consists of two RNNs (Recurrent Neural Network) : An Encoder and a Decoder. The encoder takes a sequence(sentence) as input and processes one symbol(word) at each timestep. Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. You can visualize data flow in the encoder along the time axis, as the flow of local information from one end of the sequence to another.
Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.
There are a few challenges in using this model. The most disturbing one is that the model cannot handle variable length sequences. It is disturbing because almost all the sequence-to-sequence applications, involve variable length sequences. The next one is the vocabulary size. The decoder has to run softmax over a large vocabulary of say 20,000 words, for each word in the output. That is going to slow down the training process, even if your hardware is capable of handling it. Representation of words is of great importance. How do you represent the words in the sequence? Use of one-hot vectors means we need to deal with large sparse vectors due to large vocabulary and there is no semantic meaning to words encoded into one-hot vectors. Let’s look into how we can face these challenges, one by one.
LIMITATIONS:
One of the limitations of seq2seq framework is that the entire information in the input sentence should be encoded into a fixed length vector, context. As the length of the sequence gets larger, we start losing considerable amount of information. This is why the basic seq2seq model doesn’t work well in decoding large sequences. The attention mechanism, introduced in this paper, Neural Machine Translation by Jointly Learning to Align and Translate, allows the decoder to selectively look at the input sequence while decoding. This takes the pressure off the encoder to encode every useful information from the input.
How does it work? During each timestep in the decoder, instead of using a fixed context (last hidden state of encoder), a distinct context vector ci is used for generating word yi. This context vector ci is basically the weighted sum of hidden states of the encoder.
ci=∑j=1nαijhj
where n is the length of input sequence, hj is the hidden state at time step j.
αij=exp(eij)/∑k=1nexp(eik)
eij is the alignment model which is function of decoder’s previous hidden state si−1 and the jth hidden state of the encoder. This alignment model is parameterized as a feedforward neural network which is jointly trained with the rest of model.
Each hidden state in the encoder encodes information about the local context in that part of the sentence. As data flows from word 0 to word n, this local context information gets diluted. This makes it necessary for the decoder to peak through the encoder, to know the local contexts. Different parts of input sequence contain information necessary for generating different parts of the output sequence. In other words, each word in the output sequence is aligned to different parts of the input sequence. The alignment model gives us a measure of how well the output at position i match with inputs at around position j. Based on which, we take a weighted sum of the input contexts (hidden states) to generate each word in the output sequence.
Padding
Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence.
EOS : End of sentence
PAD : Filler
GO : Start decoding
UNK : Unknown; word not in vocabulary
Consider the following query-response pair.
Q : How are you?
A : I am fine.
Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to:
Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
Bucketing
Introduction of padding did solve the problem of variable length sequences, but consider the case of large sentences. If the largest sentence in our dataset is of length 100, we need to encode all our sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ? There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual information in the sentence.
Bucketing kind of solves this problem, by putting sentences into buckets of different sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10. While running the model (training or predicting), we use a different model for each bucket, compatible with the lengths of query and response. All these models, share the same parameters and hence function exactly the same way.
If we are using the bucket (5,10), our sentences will be encoded to :
Q : [ PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
Word Embedding
Word Embedding is a technique for learning dense representation of words in a low dimensional vector space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic relations between words are captured by this technique. The word vectors have some interesting properties.
paris – france + poland = warsaw.
The vector difference between paris and france captures the concept of capital city.
Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word (index to word in vocabulary) from vocabulary to a dense vector of given size. In the seq2seq model, the weights of the embedding layer are jointly trained with the other parameters of the model.
Sampled Softmax
Avoid the growing complexity of computing the normalization constant.Approximate the negative term of the gradient, by importance sampling with a small number of samples.At each step, update only the vectors associated with the correct word w and with the sampled words in V’. Once training is over, use the full target vocabulary to compute the output probability of each target word
Seq2seq in TensorFlow
outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)
encoder_inputs: a list of tensors representing inputs to the encoder decoder_inputs: a list of tensors representing inputs to the decoder cell: single or multiple layer cells
outputs: a list of decoder_size tensors, each of dimension 1 x DECODE_VOCAB corresponding to the probability distribution at each time-step
states: a list of decoder_size tensors, each corresponds to the internal state of the decoder at every time-step.
outputs, states = embedding_rnn_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection=None,
feed_previous=False)
To embed your inputs and outputs, need to specify the number of input and output tokens Feed_previous if you want to feed the previously predicted word to train, even if the model makes mistakes
Output_projection: tuple of project weight and bias if use sampled softmax
Configuration
INI Keys
INI Values
Current Value
mode
train, test
test
train_enc
encoder inputs file for training (X_train)
data/train.enc
train_dec
decoder inputs file for training (Y_train)
data/train.dec
test_enc
encoder inputs file for testing (X_test)
data/test.enc
test_dec
decoder inputs file for testing (Y_test)
data/test.dec
working_directory
folder where checkpoints, vocabulary, temporary data will be stored
working_dir/
pretrained_model
previously trained model saved to file
checkpoint
enc_vocab_size
encoder vocabulary size
20000
dec_vocab_size
decoder vocabulary size
20000
num_layers
number of layers
3
layer_size
number of units in a layer
512
max_train_data_size
limit count of training data
0 (no limit)
batch_size
batch size for training; modify this based on your hardware specs
64
steps_per_checkpoint
At a checkpoint, parameters are saved, model is evaluated
300
learning_rate
Learning rate
0.5
learning_rate_decay_factor
Learning rate decay factor
0.99
max_gradient_norm
Gradient clipping threshold
5.0
II. LITERATURE SURVEY
III. EXISTING SYSTEM
DISADVANTAGES OF EXISTING SYSTEM:
1. The performance was limited because a sentence contains much less information than a review.
2. They propose to use opinion-oriented “scenario templates” to act as summary representations of the opinions expressed in a document, or a set of documents.
3. Using noun phrases tends to produce too many non-terms, while using recurring phrases misses many low frequency terms, terms with variations, and terms with only one work.
IV. PROPOSED SYSTEM
V. SOFTWARE REQUIREMENTS
TOOLS USED : POS Tagger, SENTIWORD.NET
OPERATING SYSTEM : WINDOWS
LANGUAGES : JAVA
DATABASE : FILE SYSTEMS
VI. ALGORITHMS
Apriori Algorithm
Step-1:
XML Pre-processing:
VII. OVERVIEW OF TECHNOLOGIES USED POS TAGGER
Problems?
The bot is very dramatic (thanks to Hollywood screenwriters)
Topics of conversations aren’t realistic.
Responses are always fixed for one encoder input
Inconsistent personality
Use only the last previous utterance as the input for the encoder
Doesn’t keep track of information about users
Future Scope
1. Train on multiple datasets
Bots can only talk as well as the data they are trained on, so datasets are pretty key. If you play around with the starter chatbot, you’ll realize that the chatbot can’t really hold normal conversations such as “how are you?”, “what do you want to for lunch?”, or “bye”, and it’s prone to saying dramatic things like “what about the gun?”, “you’re in trouble”, “you’re in love”. The bot also tends to answer with questions. This makes sense, since Hollywood screenwriters need dramatic details and questions to advance the plot. However, training on movie dialogues makes your bot sound like a dummy version of the Terminator.
2. Use more than just one utterance as the encoder
For the chatbot, the encoder is the last singular utterance, and the decoder is the response to that. You can see that this is problematic because conversations often go on for more than 2 utterances and you have to rely on the previous utterances to construct an appropriate response.You can modify the model to be able to use more than one utterance as the encoder input. This will make your model more like a summarization model in which your encoder is longer than your decoder.
3.Make your chatbot remember information from the previous conversation
Right now, if I tell the bot my name and ask what my name is right after, the bot will be unable to answer. This makes sense since we only use the last previous utterance as the input to predict the response without incorporating any previous information, however, this is unacceptable in real life conversation.
4.Create a chatbot with personality
Right now, the chatbot is trained on the responses from thousands of characters, so you can expect the responses are rather erratic. It also can’t answer to simple questions about personal information like “what’s your name?” or “where are you from?” because those tokens are mostly unknown tokens due to the pre-processing phase that gets rid of rare words.
You can change this by using one of the two approaches (or another, this is a very open field). Approach 1:
At the decoder phase, inject consistent information about the bot such as name, age, hometown, current location, job.
Approach 2:
Use the decoder inputs from one character only. For example: your own Sheldon Cooper bot!
There are also some pretty good Quora answers to this. I’m super excited about this direction and I’m working on a project dealing with this, so if you’re interested in talking more about this, hit me up!
5. Use character-level sequence to sequence model for the chatbot
We’ve built a character-level language model and it seems to be working pretty well, so is there any chance a character-level sequence to sequence model will work?
An obvious advantage of this model is that it uses a much smaller vocabulary so we can use full softmax instead of sampled softmax, and there will be no unknown tokens! An obvious disadvantage is that the sequence will be much longer — it’ll be approximately 4 times longer than the token-level one.
6. Create a feedback loop that allows users to train your chatbot
That’s right, you can create a feedback loop so that users can help the bot learn the right response — treat the bot like a baby. So when the bot says something incorrect, users can say: “That’s wrong. You should have said xyz.” and the bot will correct its response to xyz.
It can be dangerous because users are mean and can turn your chatbot into something utterly racist and sexist. Microsoft did that for their chatbot Tay and see what happened!
CONCLUSION
A brief overview of the history of chatbots has been given and the encoder-decoder model has been described in detail. Afterwards an in-depth survey of scientific literature related to conversational models, published in the last 3 years, was presented. Various techniques and architectures were discussed, that were proposed to augment the encoder-decoder model and to make conversational agents more natural and human-like. Criticism was also presented regarding some of the properties of current chatbot models and it has been shown how and why several of the techniques currently employed are inappropriate for the task of modeling conversations. Furthermore, preliminary experiments were run by training the Transformer model on two different dialog datasets. The performance of the trainings was analyzed with the help of automatic evaluation metrics and by comparing output responses for a set of source utterances. Finally, it was concluded that further, more detailed experiments are needed in order to determine whether the Transformer model is truly worse than the standard RNN based seq2seq model for the task of conversational modeling. In addition to presenting directions and experiments that should be conducted with the Transformer model, several ideas were presented related to solving some of the issues brought up in the criticism that was given earlier in the paper. These ideas are an important direction for future research in the domain of conversational agents, since they are not model related, but rather try to solve fundamental issues with current dialog agents. Continuation of this work will focus on trying to make open-domain conversational models as human-like as possible by implementing the ideas presented.
Reference
Telegram Bots : An introduction for developers
Botfather
Mitsuku
Deep Learning for Chatbots : Introduction
Understanding LSTMS
Padding and Bucketing
On word embeddings – Part 1
On word embeddings – Part 2 (sampled softmax)
Attention and Memory in Deep Learning and NLP
easy_seq2seq
English-Frnch Translation in Tensorflow
Cornell Movie Dialog Corpus
Cornell Movie Dialog Corpus – Preprocessed
Flask : Quick Start