Natural Language Processing

Natural Language Processing​

Back in the days before the era – when a Neural Network was a frightening, enigmatic mathematical interest rather than a powerful tool – there were surprisingly many relatively successful applications of classical exploration algorithms in the Natural Language Processing Algorithms (NLP) domain.

It seemed that problems including spam filtering or part of speech tagging could be solved with rather straightforward and interpretable models.

However, not every problem can be solved this way. Simple models fail to adequately capture linguistic subtleties like context, idioms, or irony (though humans often fail at the one too). Algorithms based on total summarization (e.g. bag-of-words) were not strong enough to capture the sequential character of the text, whereas n-grams struggled to simulate general context and suffered severely from a curse of dimensionality. Even HMM-established versions had trouble beating these problems due to their memorylessness.

First breakthrough – Word2Vec

Among the chief challenges in speech analysis is the way of transforming text into numerical input, which makes modeling possible. It is not a problem in computer vision tasks due to the fact that in an image, each pixel is represented by three numbers depicting the saturations of three base colors. For many years, researchers tried numerous algorithms for finding so-called embeddings, which refer to representing text as vectors. Initially, the majority of these approaches were based on counting words or short sequences of words (n-grams).

The initial approach to tackle this problem is one-hot encoding, where every word in the vocabulary is represented as a distinctive binary vector with only one nonzero entry. An easy generalization is to encode n-grams (sequence of consecutive words) rather than single words. The significant drawback to this process is very large dimensionality, each vector has a dimension of the vocabulary (or even larger in case of n-grams) which makes modeling difficult.

Another drawback of this method is the lack of semantic information. This usually means that all vectors representing single words are equidistant. In this embedding, distance synonyms are equally as far from each other as entirely unrelated words. Applying this type of word representations unnecessarily makes tasks a lot more difficult as it forces your model to memorize particular phrases instead of attempting to capture the semantics.


The first key leap forward for the natural language processing algorithm came in 2013 with the debut of Word2Vec — a neural network-based model used only for generating embeddings. Imagine starting from a sequence of words, removing the center one, and using a version forecast it only by taking a look at context words (i.e. Continuous Bag of phrases, CBOW). The alternative version of that model is asking to predict the context given the middle word (skip-gram). This idea is counterintuitive since such a model might be utilized in information retrieval tasks (a certain word is lacking and the problem is to forecast it using its context), but that’s seldom the case.

Instead, it ends up that in the event that you initialize your embeddings randomly and then use these as learnable parameters in training CBOW or a skip-gram version, you obtain a vector representation of every word which can be used for any task. These powerful representations emerge during instruction since the version is forced to comprehend words that appear in precisely the same context. In this manner, you avoid memorizing specific words, but rather convey the semantic meaning of the phrase explained not by a word, but by its context.

In 2014 Stanford’s research group challenged Word2Vec with a strong competitor: GloVe. They proposed another approach, arguing that the best way to encode the semantic meaning of words in vectors is by way of global word-word co-occurrence matrix instead of local co-occurrences as in Word2Vec. As you can see in Figure 2 the proportion of co-occurrence probabilities can discriminate words when in comparison to this context word. It is around 1 when both keywords co-occur very frequently or very seldom with the circumstance word. Just when the context word co-occurs with among the target words is that the ratio either very small or really big.

This is the intuition behind GloVe. The exact algorithm involves representing words as vectors in a manner that their difference, multiplied by a circumstance word, is equal to the ratio of this co-occurrence probabilities.

Further improvements

Even though the new strong Word2Vec representation boosted the performance of several classical algorithms, there was still a need for a solution capable of capturing sequential dependencies in a text (both long- and short term ). Vanilla RNNs take advantage of the temporal nature of text data by feeding words into the network sequentially when using the information about previous words kept in a hidden-state.

rnn unrolled
These networks proved quite powerful in handling local temporal dependencies, however, performed quite badly when presented with long sequences. This collapse was caused by the fact that after each time step, the material of this hidden-state was overwritten by the output of this network. To address this issue, computer scientists and investigators designed a fresh RNN structure known as long-short term memory (LSTM).

LSTM deals with the problem by introducing an excess unit in the network called a memory cell, a mechanism that is responsible for keeping long-term dependencies and several gates responsible for the control of the information flow in the device. This works at every phase, the forget gate creates a percentage that depicts an amount of memory cell content to forget. Then, the input determines how much of this input will be added to the content of the memory cell. In the end, the output decides how much of the memory cell content to generate as the entire unit’s output.

All the gates act like regular neural network layers with learnable parameters, meaning that over time, the network adapts and is much better in determining what type of input is applicable for the task and what information can and should be forgotten.

LSTMs have been around since the late 1990s, however they’re rather expensive computationally and memory-wise, therefore it’s only recently, due to remarkable improvements in hardware, so that it became feasible to train LSTM networks at a reasonable time. These days, there exist many versions of LSTM such as mLSTM, which presents multiplicative dependence on the input or GRU that, thanks to an intelligent simplification of the memory cell upgrade mechanism, significantly diminished the number of trainable parameters.

After a while, it became apparent that these models significantly outperform classic approaches, but researchers were more hungry for more. They began to study the astonishing achievement of Convolutional Neural Networks in Computer Vision and wondered whether those theories could be integrated into NLP. It quickly turned out that a straightforward replacement of 2D filters (processing a small sector of the image, e.g. regions of 3×3 pixels) using 1D filters (processing a little part of the sentence, e.g. 5 consecutive words) made it possible.

Likewise to 2D CNNs, these versions learn more and more abstract features as the network gets heavier with the initial layer processing raw input and all succeeding layers processing outputs of its predecessor. Of course, one phrase embedding (embedding distance is usually around 300 dimensions) carries much more information than a single pixel, which means that it not required to utilize such heavy networks as in the case of pictures. You might think of it because the embedding doing the task supposed to be carried out by the first few layers, which means they can be skipped.

Those intuitions proved right in experiments on various tasks. 1D CNNs were much lighter and much more precise than RNNs and may be trained even an order of magnitude quicker due to an easier parallelization.

Despite amazing contributions made by CNNs, the networks still suffered from several drawbacks. At a classic arrangement, a convolutional network consists of numerous convolutional layers that are responsible for producing so-called feature maps and a module transforming it into forecasts. Contain maps are basically high-level attributes extracted from the text (or picture) preserving the place where it emerged in the text (or picture).

The prediction module performs aggregating operations on feature maps and either ignores the location of the feature (completely convolutional networks) or more commonly: learns where specific features appear most frequently (fully connected modules). The problem with these approaches appears for example in the Question Answering job, in which the model is supposed to generate the answer given the text and a question.

In cases like this, it is hard and frequently unnecessary to store all information carried by the text in a single text, as can be accomplished by classic forecast modules. Rather, we would like to focus on a particle part of the text where the most crucial information is saved for a specific question.

This problem is addressed by Attention Mechanism, which weighs regions of the text depending on what may be applicable depending on the input signal. This strategy has also been found useful for classic programs like text translation or classification. Will Focus transform the NLP area? Ilya Sutskever, Co-founder and Research Director of OpenAI stated in an interview:

I am very excited by the recently introduced attention models, due to their simplicity and due to the fact that they work so well. Although these models are new, I have no doubt that they are here to stay, and that they will play a very important role in the future of deep learning.”

llya Sutskever, OpenAI

Natural Language Generation

You may have noticed that each of the aforementioned tasks share a frequent denominator. For opinion analysis, an article is obviously positive, negative, or neutral. In document classification, each example belongs to a single class. Where the model is presented with an illustration and a correct value related to it. Things get tricky when you need your model to create text.

Andrej Karpathy provides a comprehensive review of how RNNs handle this problem in his excellent blog post. He reveals examples of deep learning utilized to generate fresh Shakespeare books or the way to produce source code which appears to be composed by a human but really doesn’t do anything. These are excellent examples that reveal how successful this kind of model might be, but there are also real-life enterprise applications of these calculations. Imagine you would like to target clients with advertisements and you don’t want them to be generic by copying and pasting the same message to everybody.

There is definitely no time for composing thousands of different variations of it, so an advertisement generating tool can come in handy. RNNs appear to perform fairly well at producing text at a character level, meaning that the network predicts consecutive letters (also distances, punctuation, and so forth) without actually being aware of an idea of the word. However, it turned out that those models really struggled with audio creation. That is because to produce a word you want only a few letters, but when producing sound in high quality, with even 16kHz sampling, you will find hundreds or perhaps even thousands of points which form a spoken phrase.

Again, researchers turned to CNNs and with good success. Mathematicians in DeepMind developed an extremely sophisticated convolutional generative WaveNet model, which deals with an extremely large open field (length of the actual raw input) problem by utilizing a so-called attrous convolutions, which increase the receptive field exponentially with every layer. This is now the state-of-the-art model significantly outperforming all other available baselines but is very expensive to use, i.e. it takes 90 seconds to generate 1 minute of raw audio. This means that there is still a great deal of space for advancement, but we are definitely on the ideal path.

Natural Language Processing Algorithms for Machine Translation

Now, we proceed into the real deal: Machine Translation. Machine Translation has introduced a severe obstacle for quite some time. It is necessary to understand that this an entirely different job than both previous ones we’ve discussed. For this task, we need a model to predict a sequence of phrases, rather than a tag. Machine Translation makes apparent what the fuss is about using Deep Learning since it has been an unbelievable breakthrough in regards to sequential information.

Inside this blog post, you can read more about how — yep, you guessed it — Recurrent Neural Networks undertake translation, and in this one, you can learn about how they achieve results. Say you want an automated text summarization model, and you would like it to extract just the most important pieces of a text while still maintaining all of the meaning.

This requires an algorithm that may understand the entire text whilst focusing on the specific elements that carry most of the meaning. This problem is solved by mentioned attention mechanisms, which can be introduced as modules inside an end-to-end alternative. Lastly, there is a question answering, which comes as close to Artificial Intelligence as you can get. For this undertaking, not only does the design should understand a query, but it’s also necessary to have a full understanding of a text of attention and know just where to look to create an answer.

For a detailed explanation of a query answering alternative (utilizing Deep Learning, obviously), have a look at this article.

attention mechanism
Since Deep Learning offers vector representations for a variety of sorts of data (e.g., text and images), you can build models to concentrate in various domains. This is how investigators came up with visual question answering. The undertaking is “trivial”: simply answer a question about an image. Seems like a job for a 7-year-old, right? However, deep models are the first to create any reasonable results without human oversight. Results and a description of such a model have been in this newspaper.

Typical NLP problems

You will find a variety of language tasks which, while easy and second-nature to humans, are very difficult for a machine. The confusion is mostly due to linguistic nuances like irony and idioms. Let us take a peek at some of the regions of NLP that investigators are attempting to handle (roughly in order of their complexity): The most common and potentially easiest one is sentiment analysis. That is, basically, deciding the attitude or emotional reaction of a speaker/writer toward a particular topic (or in general).

Check out this fantastic article about using Deep Convolutional Neural Networks for gauging sentiment in tweets. Another interesting experiment demonstrated that a Deep Recurrent Web could the master sentiment by accident.

A natural generalization of the previous situation is record classification, where instead of assigning one of three possible flags to each article, we solve a normal classification problem. According to a comprehensive comparison of algorithms, it’s safe to state that Deep Learning is the way to go fortext classification.


So, now you know.

Deep Learning emerged in Natural Language Processing algorithms quite recently due to computational issues, and we had to learn much more about Neural Networks to understand their capabilities.

DevsData is a software and Machine Learning consulting company from New York City with extensive experience in NLP.Also, if you are interested in other programs of Deep Learning, be certain to read our case study on real-time detection for a military company.

DevsData – a premium technology partner

DevsData is a boutique software and recruitment agency. Get your software project done by Google-level engineers or scale up an in-house tech team with developers with experience relevant to your industry.

Free consultation with a software expert

Contact Us

Online Reviews

DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with.”

Nicholas Johnson


Got a project idea?

Let's have a call to:

Got a project idea?

Quote mark

Best back-end engineers I've ever worked with.​

“I interviewed about a dozen different firms. DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with. I’ve worked with a lot of very well-qualified developers, locally in San Francisco, and remotely, so that is not a compliment I offer lightly. Their depth of knowledge and their ability to get things done quickly."

Nicholas Johnson




Thank you

We'll get back to you within 1 bussiness day.