This is intended to be the first part of a multi-part blog on implementing Google's seq2seq architecture for the purpose of making a chatbot.
Full disclaimer beforehand, I am totally new to the field of deep learning, so this is not so much a tutorial as a blog about my own process to how I did it. There will be many egregious errors on my part, so take everything I do with a grain of salt (and if you know a better way to do it let me know!).
The code being referenced will be on my github: here
Quick Intro to Seq2Seq
Seq2Seq is an neural network structure making use of the time dependent nature of RNN's. It allows one to map sequences of vectors to sequences of vectors. This is powerful as we see really awesome software like translators, chatbots, and more.
I chose to use the same OpenSubtitle movie data set Google used in their neural conversational paper. To begin with I created a script which downloaded, unzipped, and tokenized the lines in the movie subtitles files. This script output a raw.txt file which contained space delimited tokens on each line, where a line represents a phrase spoken by an individual.
I made the same assumption Google made in their paper which is that each successive line is a 'response' to the previous source line. This will introduce a lot of noise in the network, but hopefully this will be an ok assumption.
When considering the topic of word embeddings, I considered using Torch's built in
nn.LookupTable() to act as a projection from the index space to the continuous representations space. In the end I decided against it, as it does not allow for minibatch training. Instead I chose to use rotmanmi's word2vec wrapper to create my own lookup table of sorts based on my input vocabulary. Using this method, each token was represented by a 300 dimensional vector.
In the future I would like to build my own application specific word embeddings software.
To begin with I mapped each token to an index in a vocabulary mapping. I then mapped each phrase to a lua table of indices.
["The", "cat", "is", "on", "the", "mat", ".", "EOS"] = [1, 2, 3, 4, 1, 5, 6, 7]
Note the addition of the
EOS token. This was appended to every phrase during this word to index mapping process. Infrequent tokens were removed and replaced with the
UNK token. I then found the length of the largest phrase in the corpus file, and padded every other phrase to this length. I used the padding index of -1. This was done so that everything could be put into a torch Tensor. As far as I can tell from experimentation, and the documentation, one cannot have different length tensors inside of any tensor dimension.
The final step of my minibatch preparation was to actually create the minibatch files. In code my function takes in three parameters
MiniBatchLoader.createMiniBatches(evalFrac, testFrec, trainFrac,...)
These numbers represent the fractional quantities of minibatches to place in each file. By default I have it set to:
trainFrac = 0.95, evalFrac = 0, testFrac = 0.05
The minibatches were created by iteratively pairing up a
(source,target) pair. Each ith and (i+1)th line were paired up as source, and target respectively. Each source target pair was placed in a tensor. This tensor was to be divided into train/test/eval minibatches.
Unfortunately using the above method led to many memory issues. I repeatedly hit luajit's 1gb limit. To side step this issue I divided the data into multiple files, and then split my minibatches into multiple files. I am not sure of the negative effects of doing this.
Although I may have been doing something stupid, a quick calculation can show that the memory issues were expected:
If each integer index is stored as a 32-bit number, and there are 582 of them per line, with about 2.6 million lines in the corpus file (These are rough approximations of the actual numbers). Then the overall expected processed data size is:
8 bytes * 582 * 2.6million = 6052800000 bytes = 6.0528 GB
For the next part of this blog I will find out a way to decrease memory usage as this seems very high. I will also begin the actual implementation of the LSTM neural network with the encoder-decoder structure.