But not every problem can be solved this way. Simple models fail to adequately capture linguistic subtleties like context, idioms, or irony (though humans often fail at that one too). Algorithms based on overall summarization (e.g. bag-of-words) turned out to be not powerful enough to capture sequential nature of text , whereas n-grams struggled to model general context and suffered severely from a curse of dimensionality. Even HMM-based models had trouble overcoming these issues due to their their memorylessness.
One of the main challenges in language analysis is the method of transforming text into numerical input, which makes modeling feasible. It is not a problem in computer vision tasks due to the fact that in an image, each pixel is represented by three numbers depicting the saturations of three base colors. For many years, researchers tried numerous algorithms for finding so called embeddings, which refer, in general, to representing text as vectors. At first, most of these methods were based on counting words or short sequences of words (n-grams).
The initial approach to tackle this problem is one-hot encoding, where each word from the vocabulary is represented as a unique binary vector with only one nonzero entry. A simple generalization is to encode n-grams (sequence of n consecutive words) instead of single words. The major disadvantage to this method is very high dimensionality, each vector has a size of the vocabulary (or even bigger in case of n-grams) which makes modeling difficult. Another drawback to this approach is lack of semantic information. This means that all vectors representing single words are equidistant. In this embedding, space synonyms are just as far from each other as completely unrelated words. Using this kind of word representations unnecessarily makes tasks much more difficult as it forces your model to memorize particular words instead of trying to capture the semantics.
The first major leap forward for natural language processing algorithm came in 2013 with the introduction of Word2Vec – a neural network based model used exclusively for producing embeddings. Imagine starting from a sequence of words, removing the middle one, and having a model predict it only by looking at context words (i.e. Continuous Bag of Words, CBOW). The alternative version of that model is asking to predict the context given the middle word (skip-gram). This idea is counterintuitive because such model might be used in information retrieval tasks (a certain word is missing and the problem is to predict it using its context), but that’s rarely the case. Instead, it turns out that if you initialize your embeddings randomly and then use them as learnable parameters in training CBOW or a skip-gram model, you obtain a vector representation of each word that can be used for any task. Those powerful representations emerge during training, because the model is forced to recognize words that appear in the same context. This way you avoid memorizing particular words, but rather convey semantic meaning of the word explained not by a word itself, but by its context.
In 2014 Stanford’s research group challenged Word2Vec with a strong competitor: GloVe. They proposed a different approach, arguing that the best way to encode semantic meaning of words in vectors is through global word-word co-occurrence matrix as opposed to local co-occurrences as in Word2Vec. As you can see in Figure 2 the ratio of co-occurrence probabilities is able to discriminate words when compared to the context word. It is around 1 when both target words co-occur very often or very rarely with the context word. Only when the context word co-occurs with one of the target words is the ratio either very small or very big. This is the intuition behind GloVe. The exact algorithm involves representing words as vectors in a way that their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.
Even though the new powerful Word2Vec representation boosted the performance of many classical algorithms, there was still a need for a solution capable of capturing sequential dependencies in a text (both long- and short-term). The first concept for this problem was so-called vanilla Recurrent Neural Networks (RNNs). Vanilla RNNs take advantage of the temporal nature of text data by feeding words to the network sequentially while using the information about previous words stored in a hidden-state.
I am very excited by the recently introduced attention models, due to their simplicity and due to the fact that they work so well. Although these models are new, I have no doubt that they are here to stay, and that they will play a very important role in the future of deep learning.”
llya Sutskever, OpenAI
There are a variety of language tasks that, while simple and second-nature to humans, are very difficult for a machine. The confusion is mostly due to linguistic nuances like irony and idioms. Let’s take a look at some of the areas of NLP that researchers are trying to tackle (roughly in order of their complexity):The most common and possibly easiest one is sentiment analysis. This is, essentially, determining the attitude or emotional reaction of a speaker/writer toward a particular topic (or in general). Possible sentiments are positive, neutral, and negative. Check out this great article about using Deep Convolutional Neural Networks for gauging sentiment in tweets. Another interesting experiment showed that a Deep Recurrent Net could the learn sentiment by accident.
Fill in an enquiry form and we’ll get back to you as soon as possible.