10 Jan 2017

Tensorflow and AI at a web development consultancy

Categories: Artificial Intelligence, Machine Learning, Tensorflow

In 2016 we didn't see a day go by without some major AI story on the popular tech blogs. It's a hot topic, no one can deny it. I won't bore you with yet another reference to AlphaGo, but cool stories about generative audio models, AI-generated paintings and neural machine translation keep coming. Most of these models are not really applicable by developers directly. However, as developers, we can't stay oblivous to the underlying techniques that make this new kind of computing possible. That is why we started building something useful for one of our clients.

Tl;dr: As a web development consultancy, we built a neural suggestion system for human translators in a couple of weeks. We did so by using Tensorflow and free open data. The resulting system is fast and gives good results.

Predictive models are becoming more prevalent in web applications. End-users are now expecting applications to become smarter and learn from their interactions. The Google’s and Facebook’s of the world have figured out how to do this splendidly. On the other hand—we have to admit—a small consultancy like ours really hadn’t, only one years ago. So because of all this, and maybe because we were just a little hyped, we wanted to start offering shiny AI to clients as well. However, we didn’t really know where to start. These predictive models still felt as much like “hocus pocus” to us, as they may do to you right now.

We could just have started with integrating artificial intelligence by letting one of our applications consume some new machine learning API—like one from Google Cloud or AWS. These solutions are often domain specific and not very flexible. So instead, we truly embraced AI, made a plan and started building models of our own. This will be the story about how we built our first AI-system for one of our clients back in May of last year.

A Neural Translation Aid

Fairlingo is a Dutch startup and a sharing economy platform for human translations. If you are a translator you can earn an income by translating documents on the site. If you want to have something translated, relatively cheaply and quickly, you can submit your document to the site and have the quality guarantees Fairlingo provides. Or as a startup hipster would say: “It’s like Uber for translation”.

When I started reading some interesting stuff on neural machine translation and was studying the documentation of the TensorFlow machine learning library, it dawned on me that we could fairly easily build a suggestion system for Fairlingo’s translators. After convincing our client, and four weeks of hacking, research and implementation we finally we ended up with a prototype that looked like this:

Accepting only the first suggestion

This picture shows Fairlingo’s interface, where we translate from Dutch into English. The system suggests the words in auto-complete fashion. The suggestions pop up fast, and the resulting translation is accurate. The model underneath it is powered by a large recurrent neural network (RNN).

A simple model architecture

So, there are several steps involved in building such a system. First we need to define an architecture for our model. When you are prototyping a model the same rules apply as when you are prototyping a web application. Start simple, and do not add bells and whistles immediately. Otherwise you’ll quickly get in over your head and lose oversight.

In order to construct an architecture in TensorFlow, you should define a computation graph. It is a bit like, defining a program before you running it with actual data. The computation graph is one of these new programming paradigms that is very important to deep learning models. This video introduces the concept well.

Because we would like to give suggestions for the next word given the previous words, we start of with a simple RNN architecture. This is a neural network that passes along a memory vector over each time step. At each time step in you enter a specific token, which are word-based in our case. The model’s task at each time step is to predict the next token. The memory is represented by the output of an RNN unit such as an LSTM or a GRU.

Language model rnn architecture

Both Tensorflow and Theano have the scan() function which can be used to pass the appropriate output state along for each time step. This article has some good examples on how to use it. Furthermore, Tensorflow also has its own RNNCell which is an alternative way to implement an RNN.

The actual input of the network is a large one-hot vector, i.e. a vector where all entries have a zero value except for one. This large vector has the size of the chosen vocabulary, e.g. 50,000. This means that the 50,000 most common words will be used in the vocabulary. All other words will be assigned to a special <unk> token. Additionally, we add a special <eos> token to indicate the end of a sentence. So in the end, each word in the vocabulary is represented by an index in this large one-hot vector.

Before entering the RNN, the one-hot vector will be transformed into an embedding. These word embeddings represent the token in an n-dimensional vector, where n is way smaller than 50,000. n is 1024 in our case. These embedding vectors allow the network to capture the relationship between words. It is pretty neat and maybe sounds a bit complicated—luckily Tensorflow has a very simple method for adding an embedding layer to the network’s architecture.

So, these embedding vectors are the actual input of the RNN. It’s output can than be used to calculate probabilities for the next token. This is often done using the softmax function. It results in probabilities for each of our 50,000 tokens. The only thing left for us to do is sorting the tokens and suggest the ones with the highest probability to the user.

Now that we have defined an architecture we need to give the model its parameters. Using an optimizer we can train the model’s parameters to minimize the error between the predicted token and the actual token. This optimization procedure used, is called stochastic gradient descent and is one of the reasons neural networks work as well as they do.

So now we have a model that can predict the next word given the previous word. Such a model is often called a language model, and is used in our smartphone keyboards. However such a model is sadly not able to translate anything yet, since it can’t take the source language into account.

A more elaborate model architecture

So now that we have built a fully functioning language model and understand the basics of recurrent neural networks for natural language processing (NLP) we can continue and expand our architecture to a so called sequence to sequence networks.

Sequence to sequence rnn architecture

These networks can encode a sequence of tokens into a memory vector (or hidden state) much in the same way as a language model does. So in essence, at each time step you have a representation of not just the input word through the word embedding, but also one of all previous words through the memory vector. A sequence to sequence model is based around the idea that we could encode the entire sentence and then use the latest memory vector to decode the representation into a another sequence. This decoding happens token for token by entering the previously predicted token back as input in the next time step.

An important thing to mention is that the neural network has separate weights for the decoding and encoding steps. Furthermore, for a simple model, it is a good idea to reverse the source sentence, as shown in the figure above. More elaborate bi-direction approaches are even better, but let’s leave those out for now.

So I hope you now see where I’m going with this. The actual suggestions produced by our system are the most probable words at a specific decoding step from our sequence to sequence model. Such a model can also be used to predict entire translations by simply predicting until the <eos> token is seen. Actually, Google Translate is starting to shift to this approach.

One of the big advantages of this model, especially for suggestion systems, is that we can enter our own input into the decoder. And thus ignore the previous suggestions. This is exactly what we need to compensate for the different prefix an end-user might have entered. To clarify, when looking at the architecture figure, we could enter something different than X or Y into the decoder and the model will adjust its predictions accordingly.

Adjusting the input and getting even suggestions then

Data and training

A neural network needs training data like cookie monster needs cookies. But as a small consultancy with a client that does not have gigabytes of data, where do we do we get more? Luckily open parallel corpa are freely available all over the internet and we used those to extend our training data. One obvious candidate was the OpenSubtitles corpus where you can get almost any language pair. Omnomnom🍪, thank you pirates!

An obvious downside is the specific domain of each dataset. A system trained on only subtitles will have great suggestions for when you are translating subtitles, but not so much for anything else. We have to start somewhere though, and as more data from Fairlingo itself comes in, we could use that to enhance the system even further.

high-end consumer gaming card We trained our models on a nVidea GeForce GTX 1080 Ti which is a high-end consumer gaming card. When selecting a GPU, you should pay specific attention to the amount of VRAM the card has. This is important since the card should be able to hold the entire model in its memory. The larger the VRAM, the larger the model you can train, the better it performs eventually.

So, we let it spin for a couple of days (something like 60 hours) and the results for our Dutch - English system are pretty good as you can see.

Serving the suggestions

So because Tensorflow has python bindings we can implement the entire model in Python. That way we can easily and efficiently couple the entire thing to a Tornado web server and serve our suggestions directly to clients over a WebSocket. This allows for a fast connection from client directly into the GPU without much overhead. The entire Fairlingo implementation is written in PHP with a Laravel backend. But by accessing the Tornado WebSocket on a different endpoint we can keep the suggestion system separate from the main application.

In a production environment, it is important to combine multiple incoming requests into a single batch in order to utilize the GPU at it’s full capacity. This will always be a trade-off between speed and utilization.

A big disadvantage is the amount of language pairs Fairlingo offers. So with our current method we would need to create and train a model for each specific language pair in order to make the system available to every translator. Google has tackled this problem by creating an interlingua representation, with which they can have just one encoder and one decoder model for each language.

The last hurdle for running large deep learning models in production is the actual use of GPU’s. If you spin up a GPU machine in the cloud you are easily paying hundreds of dollars per month. Managing the machines yourself, as we did is cheaper but a hassle. That is why I’m excited to see if Google’s new TPU’s will come available through Google Cloud directly in the coming years. These chips offer interesting new possibilities for running models in production without managing expensive GPU servers. Coupling them through the GRPC compatible Tensorflow library should be easy. I imagine they would make it possible to scale a production system similar to how we scale webservers on AWS or Google Cloud. We will have to wait and see how this unfolds.

Testing it with users

user study at vertaalbureau perfect We have successfully created a suggestion system. Now does it actually help translators do their job any better? We asked 6 translators to test our new suggestion system. During the user study, we actually recorded and measured each keystroke the translators made to deduce which words where entered by themselves and which were entered through the suggestion system. We found that the translators actually entered 23% of all characters through the suggestion system, they did not slow down considerably and the system did not compromise the translation quality.

This sounds grim, but if you consider other suggestions system that have been tested by researchers in the past it is actually pretty good. The impact on translation speed was minimal and all but one translator liked the system and would use it, if it was available to them.

To find out if the translation quality was not compromised we let people rank different translations, some of them done with the system enabled, some of them done without. CrowdFlower is a great website which lets you setup these kinds of experiments.

Wrap up

So after four weeks we managed to implement and test a suggestion system that incorporates a neural network and runs on a GPU. In our experience, four weeks is not out of the ordinarily for a big feature. So what is stopping your company or you from starting to play around with these things? I can only encourage everyone to start informing their clients of the exciting new possibilities.

Maybe one final catch, we all know the programmer’s fallacy when it comes to time estimation. This is sadly even more the case with building models. You do not know beforehand if it will work, and if it doesn’t whether it is because of slight error or just wrong architecture. No unit test is going to give you a definite answer. It really should be approached as research at first. But when it works, it is awesome.

So what’s next for AI at Label305? We will continue offering AI service and want to make it one of our key services in the coming years—next to: web, native mobile and UX. We believe that there should be expert on our management team for all our key services. That means we will have to find an expert on AI. Well… suffice it to say, that is not very easy. These people are in short supply it seems. So, I’m very fortunate that Label305 is doing great and my three co-founders allowed me two years to get a masters degree in Artificial Intelligence at the UvA. A co-founder not dropping out, but starting his studies is a rare thing, I know 😉.

But as you have seen, even though I’m now planning to get one, you definitely do not need a degree in Artificial Intelligence to start building these kinds of systems yourself.

As a true member of the artificial intelligence research community we wrote a paper on our project and user study. You can find the paper over here: Interactive Neural Translation Assistance for Human Translators (PDF). Work on the user study was done in collaboration with the Insitute of Logic, Language and Computation at the University of Amsterdam. We had great help from Philip Schulz who is a PhD candidate over there.

Written by: Thijs Scheepers