This the second entry in my series about implementing Google's seq2seq neural network architecture. My goal is to implement it for the purpose of building a chatbot using torch 7. Just a reminder, this is not a tutorial so much as a a description of events made by someone entirely new to deep learning. As such, everything done here should be taken with a grain of salt. When I run into an issue I will try to be as transparent about it as possible, including both my description of it, and attempted solution space.
I was having trouble loading the minibatches into memory, having to split them across multiple files due to exceeding the luajit 1gb limit. While I could solve this problem by creating thousands of small files, it seemed like a highly inefficient thing to do. For some numerical justification of why this might be an issue with fairly large sequences of data:
Suppose I have
n integer sequences of varying lengths. In order to put them in a torch tensor file I need to pad the lengths to all be the same. Unfortunately, some of the sequences are drastically longer than the others. Suppose all sequences are padded to length
max_length. The final variable in approximating how much memory is used revolves around choice of numerical type. In this case all the sequences are integers. Unfortunately, storing all the integers in a nice 32-bit size will use 4 bytes per sequence element! So far the estimated size of the data set is:
n * max_length * 4 bytes
Using a decently sized data set like the OpenSubtitle movie set, this computes (approximately):
2.7e6 * 582 * 4 bytes = 4.656gb
Of course, there are numerous problems with this, most importantly, almost all of of the data set is just padded integers!
Solving Memory Issues
In the Last part, I was mentioning the memory problems I was having with luajit. After processing my data into arrays of integers it was taking up more than the 1gb of allowed memory. I made some small changes that drastically improved my memory usage. The first thing I did was switch to a short tensor. Of course, using a short only uses 2 bytes, so it uses 2 times less memory than a 32-bit int. The new number is:
2.7e6 * 582 * 2 bytes = 3.142gb
This is however still a large number, and mostly consisting of padded integer values! I had another problem related to using a short, my vocabulary size was just over the signed ~32000 limit at about 55000 tokens, this resulted in overflow! As far as I can tell, there is no unsigned short tensor type in lua, only signed. Of course, I can restrict my vocabulary--and maybe I should, but I have no idea if my vocab size is 'too big' or not--however, I don't want to impose such a serious limitation at such an early stage in development. As the next largest size of integer tensor is the regular 32-bit variant, I resigned to work with that.
For option two, I decided to limit my sequence length. If I am building a chatbot, how long are the inputs (and outputs) really going to be? I looked through some of my facebook messenger conversations, and the longest reply I could find was about 100 tokens. Great, ok, this will solve both the problem of memory usage, and make the sequences more dense in terms of having less padded values. At first I considered using buckets for sequence length to further reduce the padded integer usage, unfortunately I think it might negatively affect the training results. After doing some research, I've come to realize the (source, target) pairs are continuous, not discrete as in the case of a seq2seq translator. I interpret this to mean that the order of phrase-pairs matters in training, which makes it difficult to sort them in buckets by sequence length. I decided to try a global sequence size limit of 45 tokens. All pairs with a phrase over that limit were thrown out (there weren't too many). The new size is:
2.7e6 * 45 * 4 bytes = 486mb
This is much better, but still not quite what I would like. Due to some of the phrases being thrown out by exceeding the 45 sequence size limit, the actual size after running the program was approximately
I still don't consider this problem to be 'solved', and I plan to revisit it again, but for now it is workable. I think the next step I will take is to go back to the short tensor, and use an offset in the vocab integer indices. This will enable me to get a max vocab size of 65535. I can achieve this by mapping the range
0 to 65535 -> –32,768 to 32,767
My main hesitation in doing this is introducing potential bugs in my code down the line. I would also need to add logic to handle the case where the vocab of the corpus file exceeds 65535, and it will have to fall back to using a regular 32-bit int. Unfortunately this is an added complexity that could be error prone in the future. I still plan to test this method out though.
Being relatively new to processing on larger data sets I encountered several problems relating to memory. I managed to reduce the memory footprint by looking at the data and introducing a max sequence size. I chose to remove outlier long sequences that exceeded this size. I was able to reduce the size of the data in memory by a factor of about 9. When I revisit the memory problem I will attempt to reduce this size further by a factor of 4, by switching to using short int types for storage, rather than 32-bit ones.
Next post I will briefly continue the conversation of memory as a follow up to this post, I will also begin the conversation of how I am actually designing the LSTM model of encoder-decoder structure.