📚 Language Models

Statistical language models are fundamental components of Natural Language Processing (NLP). Have you ever heard names like ELMo, BERT or GPT-2 in the news? Quoted as breakthroughs in natural language processing, they are all language models.

You may assume that making machines generate coherent paragraphs of text we need complicated and intimidating systems. Instead, language models are simple: given a sequence of words, they predict what word is likely to follow.

For instance, if you read the sentence “I am so energetic, last night I slept so …” you know that the word well is more likely to come next than the word chicken. Language models represent this intuition using probabilities:

As with most math, the sophistication comes from the number of things we can build upon this elegant foundation. For instance, language models may compute the above probability with things other than words. We know that given the utterance “I a …“, “I am” is more likely than “I az” and we can express this as $P(m \mid I\ a) > P(z \mid I\ a)$. Some language models such as ELMo and BERT, are bi-directional, which means that they predict a code not just from what came before but also from what came after it.

What is harder to conceive of is the complexity of different relationships between words that may exist in a large body of texts and the multi-dimensionality of the probability distributions that language models need to represent. Try, for instance, to imagine a probability distribution that models the next word for each sequence of four words in the English Wikipedia, which contains (as of 2019) an estimated 3.9 billion words.

The simple estimation from a source text of the frequency of each possible word given four input words is known as a 4-gram model. N-gram models (where the number of input words or characters to compute probability becomes a parameter n) are extensively used, and often very effective, as language models.

🧠 Neural Language Models

Within this context, Neural Language Models consists of putting Neural Networks to estimate the probability function described above. Given the flexibility of Neural Networks this can be done in a variety of ways, from predicting the next character given a sequence of preceding and following characters (ELMo) to predict randomly masked words in a sentence (BERT).

A few days ago, I was reading a 2015 blog post by Andrej Karpathy titled “The Unreasonable Effectiveness of Recurrent Neural Networks”. Andrej is a Stanford PhD student that developed and shared one of the first online courses on Convolutional Neural Network and is now the Director of AI at Tesla.

In his article, Andrej describes how to develop a character-level language model using a Recurrent Neural Network (RNN) and shows how this model can generate texts surprisingly well. While the ability to generate coherent text pre-dates neural networks, the scientific community was impressed by the context-awareness of the model expressed by things such as its ability to properly indent and use brackets while writing code. This sensitivity to context became even more apparent with large scale and more complex Neural Language Models, such as Open AI’s GPT-2. GPT-2 impressed observers by self-referencing its previous paragraphs when generating a text.

While the ability to generate texts does not constitute general intelligence on its own, such impressive examples made me wonder whether I still need to write a blog myself or I could automate my writing using a language model. With this half-serious excuse to learn more about Neural Language Models in mind, I set myself to train a model on my notes and use it to generate text. Here is how it went.

🤖 Automating Myself

🛠 Warning. From here onwards the article gets into the nitty-gritty of developing a neural language model. If you are not interested in the details, even if they are spiced up with numerous anecdotes, you can skip to the final output of my experiment.

Phase 0: Tutorial

After skimming through the code that Andrej Karpathy originally released alongside his blog post, I decided to start from scratch and use Keras. I started from the tutorial How to Develop a Character-Based Neural Language Model in Keras from Jason Brownlee of Machine Learning Mastery.

The tutorial shows how to learn a language model of the nursery rhyme “Sing a Song of Sixpence”. Although I was able to quickly implement a model that reached 99% accuracy, it still got stuck in loops.

Prediction Accuracy (on the rhyme data) = 99%

Input = sing a song

Output =

Sing a song of sixpence,
a pocket full of rye.
four and twenty blackbirds,
baked in a pie.

When the pie was opened
the birds began to sing;
wasn't that a dainty dish,
to set before the king.

The king was in his
counting out his money;
the queen was in the parlour,

The maid was in the parlour,
The maid was in the parlour,
The maid was in the parlour,
...


First, the model cuts the tense the king was in his counting-house, counting out his money; into the king was in his counting out his money; since it predicted that after counting the most likely next word is out. Also, instead of learning that the maid was in the garden the model got trapped in predicting that was in the must be followed by parlour (which is where the queen was). Sadly, since parlour is followed by eating bread and honey which is in turn followed by the maid was in the, learning this pattern means getting stuck in an endless loop 😅

Phase 1: Iterating Over Bugs

Since I have not written thousands of articles yet, the best data source I could find was 1400 personal notes I wrote in the past four years. Using a sequence length of ten characters, which consists of predicting the next character given the previous nine, this yields around two millions of character sequences.

After having quickly adapted the tutorial code to my data and decreased some parameters to get results more quickly, I let the model train for about 10 epochs and then collected my output, which was a depressingly long series of repeated the.

Despite changing few things, since nothing improved the performance (stuck on the the the ...) I went back to the tutorial data (the nursery rhyme) and noticed that the model was performing badly on them as well.

📙 Keep test data that you can guarantee the model will perform perfectly on and go back to it whenever the model misbehaves on any other data.

By going back to the sample data, I figured that the bug was caused by having reduced the capacity of the model (the number of LSTM cells) too drastically from 75 to 5. While this made training over my data faster, it created a bottleneck which likely constrained the model to memorize nothing more than the most basic patterns (such as the). After fixing the bug, the model still performed poorly on my data, but small improvements could be noticed. Given the input today I am going to write about, the model predicted the output the send the send the send.

At this point, given the increased work I was putting into my experiments, I developed a small pipeline that would automatically save the structure, parameters and performance of each experiment and created a sample dataset of similar size than the nursery rhyme but made of my notes These few steps took no more than few hours but went a long way in helping me progress.

📙 Developing a good structure for your model development and training is a quickly rewarding time investment.

By now I achieved language-like but meaningless results on the sample data of my notes. The experiments structure made it much easier for me to record the following output from a training run labelled test_network_look_alike_070819_1900.

Prediction Accuracy (on the sample notes data) = 85%

Input = On automating myself

Output =

On automating myself lost. how at the frest and stayyor als it htt slose the city was bakeocord that in the for door.  htmen-bed experess i sto last difl m new angayels ware  5 as ape this a strated packeas use mand the fron meed fastion)  doig-toper -s troid to manage that you can stlll less . e man > invis sectras/ 1/3 loot, the /xtorks where and heal, the cresericas 1. must be going here?  we nee desigrions the trisl, but reffull fustcrout money that the someth, collore and kindss.


Phase 2: Hardware Muscles

Training machine learning models on CPUs (i.e. most laptops) is nowadays impossible. Thus, it was now time for me to move my model training to more powerful hardware on the Cloud. Conscious that the hardware costs of machine learning can get out of control, I decided to find a set up with a good speed/price ratio. Following the helpful guidance of the documentation from fast.ai, I set up a cloud instance on Google Cloud Platform (GPC) using the n1-highmem-8 instance type with an NVIDIA Tesla P4 as an accelerator.

My final machine was set up for £0.752 per hour. I started my training on a Friday evening and recorded that one training epoch took 255 seconds. Since I set up the training for 500 epochs, I prepared myself to collect the results on Sunday morning (after ~35 hours). After 500 epochs, 35 hours and 27£ consumed, I triggered the newly trained network to generate a sample.

Prediction Accuracy (on my full notes data) = 63%

Input = On automating myself

Output =

This is about the problem is to be a lot of statement to a state of the strength and weakness of the problem is to be a lot of statement to a state of the strength and weakness of the problem is to be a lot of statement to a state of the strength and weakness of the problem is ...


Phase 3: Better Generation

💸 Looking at the loss function value over training time, I noticed that after about the 100th epoch the model was not learning anything useful anymore and that I just wasted about 400 epochs and 20£.

Given the above, I decided to implement Keras check-pointer and early-stopper. These train the model only on a specified percentage of the data and hold out the remaining data points. They then validate the model at the end of every epoch. If the performance on the holdout data stop improving for more than a specific number of epochs (a parameter referred as patience) they stop the training and save the model that overall performed best on the holdout data.

However, even if now felt safer to try more complex models, I still was not able to get improvements. So I started looking for solutions elsewhere. Specifically, I decided to improve how the model was generating text. Text generation consists of taking the trained model and using its predictions to generate the text. There are three common ways to do it:

• Greedily keep picking the character with the highest predicted probability
• Sampling from the conditional probability distribution
• Keeping track of several likely variants at each step (known as Beam Search)

Armed with more hope, I implement all generation approaches and compared the results.

Greedy Output

This is about the problem is to be a lot of statement to a state of the strength and weakness of the problem is to be a lot of statement to a state of the strength and weakness of the problem is to be a lot of statement to a state of the strength and weakness of the problem


Sampling Output

This is about speeches naws, because we do pers customer more. doing my better. 1. i could intelligence â¢ be resporl new 10-2019 practical aug 27 - 47 meeting. i would need sometimes i would lack about the fields? how would decide it solve memomic cheiches mean sorm of preining my energiving when walkness to a function.


Beam Search Output

This is about this text for? what do i expect to find in the text? is there any major empirical conclusion and reading after reading what i have (or have not) understood? (make your own note of the text? is there any major empirical conclusion and reading after reading what i have (or have not) understood?


From the results, I noticed that greedy search was by far the worst approach and beam search could create much more cohesive text but would still get stuck in loops. On the other hand, sampling did not get suck into loops but created gibberish text.

📙 Model architectures are not everything: refining all other components in your model can change things significantly!

Accordingly, I decided to combine Beam Search and sampling into what I called Stochastic Beam Search (a beam search using sampling), which gave me surprising results.

Prediction Accuracy (on my full notes data) = 63%

Input = This is about

Output =

This is about the problem with the company and the problem and personal information and reading what i have (or have not) understood? (make your own note of the decision making and complexity is a consulting in the text? is there any major empirical conclusion and reading what are the main ideas presented in the company and then and the problem and predictions and problems and discussion of the problem and responsibilities and discussions and then and have the relationship of the company to learn from the problem with the world.


Grammatic inconsistencies aside, the model started producing text that sounded like me. You may notice that many of the notes I trained the model with were personal rather than public ones, also I studied decision making and struggled with decisions making at a company I co-founded during the last two years 😅

Phase 4: Transfer Learning

While my latest results were more encouraging, it was clear that the model still lacked many required writing skills such as completing a sentence. At this point, I started doubting the quality and quantity of the data I was training the model with. Thankfully, the latest research in NLP provides approaches to develop good models even for small datasets. The approach is called transfer learning and was kick-started by this paper from Jeremy Howard and Sebastian Ruder. In a nutshell, transfer learning consists of taking a language model that was pre-trained on a large data set (such as WikiText-103 and the fine-tuning this model for a specific small dataset.

To apply transfer learning to automate myself, I used the powerful fast.ai library. This implies that these last results are uncomparable with what obtained in other phases since both training data and model architectures changed significantly. Also, the large size of data used to the pre-train the model implied that I could remove a lot of lower quality data from my notes data used for fine-tuning. For instance, I removed all sentences without a verb or that did not end. However, from a qualitative point of view, transfer learning changed the game.

Prediction Accuracy (on my full notes data minus low-quality sentences) = 22%

Input = How can we perform better on the workplace?

Output =

How can we perform better on the workplace? It is challenging the clients that make sense of the public complexity , transform it in directly prioritising our data science model and transform it in mock - ups of actions ( that are showed of different people ) to build a da point. Workload is destructive and the people left is avoiding problems. Must the increase of the solution market will see satisfaction from it even in the actually broad market and too high need to tackle new part of it.


🥁 Output

Here are some sample final outputs of trying to automate my writing. They are obtained using the last tried approach, transfer learning.

Me: I think that

Model:

I think that AI is a key element of our mission.


Me: Today I am writing about deep learning.

Model:

Today I am writing about deep learning harness (the technology) to solve real problems. I tried in seeing proved of two structures where there are always the very best solving algorithms to model the complexity of living cells. I developed the so - called forecasting more and more speculative ve interests in more data science and artificial intelligence - despite its lack of practical focus. Since entering of neural networks I tried to uncover difference with the algorithms and then try to find computers and more then API terms and then use literature data.


Me: Riccardo

Model:

Riccardo designs and support technology solutions, with a focus on data science, machine learning and blockchain technologies. He has a background in machine learning, a technique in machine learning and machine learning a further psychology.


In summary, I would not say that my blog is at risk of machine take-over but neither that such day is that far away!