The story of NLP algorithms, NLP techniques for machine translation & typical NLP problems and how to solve them
Many organizations and enterprises of all sizes and types today must be able to tap into what is now an essential but complicated resource: big data. This ever-growing resource contains everything from customer and sales information, transactional data and research, to social media and open-source information. It is also mostly unstructured and is mostly text. This is where natural language processing (NLP) has presented itself as the next great opportunity in big data.
Natural language processing is a form and application of artificial intelligence that helps computers “read” text, similar to giving machines the human ability to understand language. It incorporates numerous methods such as linguistics, semantics, machine learning, and statistics to extract context and meaning from data, which then allows machines to comprehensively understand what is being said or written.
Rather than decoding single words or short phrases, NLP helps computers understand the complete thoughts in a sentence typed or spoken by a human. While spam filtering or part of speech tagging help in this interpretation, it is hit-and-miss. However, like many humans, most of these models fail to catch linguistic subtleties, such as context, idioms, irony, or sarcasm.
Algorithm models like Bag-of-Words (which focuses on total summarization), n-grams, and Hidden Markov Models (HMM) could not adequately capture and decode the complexities of human speech in big data.
One of the main challenges in language analysis is how we turn text into numbers, which makes it readable by machines and thus makes modeling possible. For many years, researchers have designed algorithms to try and look for word embeddings, which is essentially making vector representations of a particular word. This approach relied mainly on counting words or short phrases (n-grams).
One of the earliest approaches to this challenge is one-hot encoding, where each word is represented as a unique binary vector with just one nonzero entry. An easy, generalized approach to this is to encode n-grams (a contiguous sequence of n items from a given text or speech sample) rather than encoding individual words.
One major problem with this approach is the high dimensionality; the number of dimensions are staggeringly high that calculations and modeling become extremely difficult.
Another obstacle is the lack of semantic context. Since all vectors (single words) are analyzed based on how frequently they show up with other words, synonyms and completely unrelated words are seen by the models as equidistant. It learns frequency in sentence construction, all while not being able to understand semantics.
The first breakthrough in this field was the introduction of Word2Vec in 2013. This two-layer neural network makes highly accurate guesses about the meaning of a word based on past appearances. These guesses can then be used to establish word associations (e.g., ‘boy’ is to ‘man’, ‘girl’ is to ‘woman’), or to group and classify words by topic. Word2Vec does this in two ways:
If you start embeddings randomly and then apply learnable parameters in training CBOW or a skip-gram model, you are able to get a vector representation of each word that is applicable to different tasks. The training forces the model to recognize words in the same context rather than memorizing specific words; it looks at the context instead of the individual words.
Soon after in 2014, Word2Vec found itself a competitor in GloVe, the brainchild of a Stanford research group. This approach suggests model training is better through aggregated global word-word co-occurrence statistics from a corpus, rather than local co-occurrences. Words are mapped into a meaningful space where the distance between words shows how often or how seldom they appear together in different instances, which then analyzes if a target word has semantic similarities to context (nearby) words or phrases.
The logic behind GloVe includes treating words as vectors where their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.
Word2Vec and GloVe have proven to be well-suited to language processing problems that involve classification or regression prediction. However, most data are not in the form of fixed-size vectors, but rather in sequences.
Enter the Recurrent Neural Networks (RNNs). In simple terms, RNNs were designed to handle sequential data. Vanilla RNN, the first concept of this problem, leverages the temporal nature of text data by feeding words into the network in a sequence while cross-checking it with data about previous words kept in a hidden-state.
These networks were effective in analyzing local temporal dependencies but were not as good when the sequences started getting longer. This was because the content of the hidden-state keeps getting overwritten by the output of the network after every time step. As such, a new RNN architecture was designed: long-short term memory (LSTM).
This specialized RNN uses new internal mechanisms such as a cell (responsible for keeping the long-term dependencies), and gates (which regulate the flow of information). This structure helps the network learn which data to keep (through the input gate and then by storing them to the hidden state) and which ones to throw away (through the forget gate). An output gate then decides which cell contents will make it to the end output.
All these structures work together like regular neural network layers with learnable parameters that, over time, make the network more capable of learning long-term dependencies.
Over the years, numerous LSTM networks have appeared that were much more efficient resources-wise, such as mLSTM, which combines LSTM and multiplicative recurrent neural network architectures; and the Gated Recurrent Units (GRU), which is similar to LSTM but without an output gate, which meant less trainable parameters and faster learning.
As these deep learning models improved in efficiency, experts kept striving for better ones. They turned their focus to the success of Convolutional Neural Networks (CNNs) in Computer Vision, particularly about the possibility of it working in natural language processing. In their trial/test runs, they found out that simply switching from 2D filters (analyzing small segments of an image, such as in 3×3 pixels) to 1D filters (analyzing phrases in a sentence, such as five words in a row) was possible. Just like using 2D CNNs, these models are able to learn better by processing raw input on the first layer and then outputs by the previous cells in all subsequent layers.
One phrase embedding will always be heavier in information compared to a single-pixel (as embedding space is about 300 dimensions), which means you do not need to use deep networks to analyze it, which is the case for images. Embedding does the work meant for the first few layers, which means they can be skipped.
This line of thinking proved to be correct in various experiments on several tasks. Compared to RNNs, 1D CNNs were much lighter resource-wise and were more accurate. It could even be trained to process an order of magnitude faster thanks to its easier parallelization capabilities.
Despite the highlights and successes of CNN, it still encountered its own set of roadblocks. In a classic setup, a convolutional network has a number of convolutional layers that create ‘feature maps’ and a module to turn it into predictions. The maps are basically high-level features from a text or image that marks the location where it came from in the text or image. The prediction module then does cluster operations on these maps and does one of two things:
The problem with this approach comes up in scenarios like the Question Answering task, where the text and a question is provided, and the module is supposed to come up with an answer. In this scenario, it is often complicated and redundant to store all information carried by the analyzed text into a single text, which is the case for classic prediction modules. By contrast, the focus should be on a particle part of the text where the most important information for a specific question is stored.
This is where Attention Mechanisms come in. This approach takes into account parts of the text depending on its relevance to the input. This is often useful for classical applications such as text classification or translation.
Retail / E-commerce
Speech and language, with all its intricacies, is relatively easy for us humans — not so much for machines. The difficulties usually lie in the linguistic nuances such as idioms, irony, and sarcasm.
Researchers are currently looking into certain areas of NLP that continue to be major problems:
First and the most common is sentiment analysis. It is relatively easy for us humans to gauge the attitude, emotions, or sentiments of someone who is talking to us, or even through written posts. We can determine with relative ease if their words (either oral or written) are positive, negative, or neutral. Computers and machines lack this ability to have and analyze empathy.
However, Deep Learning has made strides into understanding sentiments better. For more about this, here is an article about Deep CNNs and how it is being used to determine sentiments in tweets. There are even several experiments showing the possibility that Deep Recurrent Net could unintentionally learn sentiments.
Related to the previous case is document classification. In this scenario, we solve a normal classification problem rather than assigning one of three possible flags to each article. Several studies have backed the idea that Deep Learning is the most ideal and practical approach to text classification.
One of the most important and valuable applications of natural language processing is in translations. While languages around the world come from only a few language families, these have branched out to countless countries and cultures. English is different in the USA, the UK, across Asia and Europe; Spanish has its own versions in Spain, and Central and South America.
These and other linguistic concerns have been a huge hurdle for machine translation for years now. It is important to note that machine translation is entirely different from sentiment analysis and document classification.
This difficult task needs a model that predicts word sequences rather than labels. Machine learning also highlights all the talk and excitement about Deep Learning, as it has been a game-changer in analyzing sequential data.
There is also the need for more efficient paraphrasing and rewording. Say you need an automatic text summarization model, and you want it to summarize the text while keeping its complete and original meaning. Attention mechanisms (introduced as modules in end-to-end solutions) prove to be valuable in these tasks, as it can understand whole blocks of texts by looking for the meaning only in specific segments of texts.
Moreover, there is question answering, which is as close to a Hollywood-level Artificial Intelligence as we can get for now. This requires the algorithm model not only to understand a question, but also to have a full understanding of context and related texts, as well as knowing where to look for the answers.
This article explains more in detail about a question answering solution using Deep Learning.
Deep Learning represents different types of data in vectors. As such, one can easily build algorithm models focusing on different domains. This is how ‘visual question answering’ came about. It is a task easy enough for a child: answering a question about an image.
However, think of it in the context and capabilities of a machine that is still only starting to understand words and images. Deep models were the first to come up with significant results in this task without human supervision. If you would like to read more about the description and results of this model, you can find it here.
Ready to get started?
In all the aforementioned tasks, there is an underlying theme – a common denominator. When it comes to sentiment analysis, texts can be positive, negative, or neural. For document classification, each entry is grouped into one class, which shows that these problems are branches from a family of problems known as supervised learning, where the model always operates with an example and an associated correct value.
However, things become more complicated when you want an algorithmic model to generate text. For this problem, RNNs prove to be the solution. Andrej Karpathy discussed this in length in this article of his.
He goes on to show how deep learning can write Shakespeare-like novels or come up with source codes that, at first look, seems to have been written by a human, but actually has no purpose or function. These examples highlight how powerful and practical such models can be.
Moreover, there are a number of real-life business applications for these algorithms. Personalized marketing and 1:1 customer journeys are all the rage right now in the field of digital marketing. However, targeting clients one by one with personalized ads and messages is time-consuming and impractical. An ad-generating tool powered by a text-generating algorithm can do the trick in this scenario.
RNNs can perform reliably at generating texts at a character level, which means the network can predict consecutive letters (even spaces, punctuations, and special characters) even without being aware of the concept of a word. However, it does have one weakness: it struggles with sound generation.
This is because producing a word (as text) only requires a few letters, and therefore a few data inputs. With sound, however, there are at least hundreds, even thousands, of data points to form a spoken word, especially a high-quality one with a 16kHz sampling.
Again, CNNs found a way to handle this successfully. Mathematicians at DeepMind, a UK- based AI company and research laboratory that was acquired by Google, succeeded in creating WaveNet, a deep generative model of raw audio waveforms. In its simplest terms, WaveNet takes a raw signal as an input, and then makes an output one sample at a time. It does so by sampling from a softmax distribution of a signal value encoded using μ-law companding transformation and quantized to 256 possible values.
The CNN can accurately model different voices, complete with accents and tones. It is also able to make music if fed with music inputs. It has beat the conventional baselines for text-to-speech systems. However, even if it is miles ahead of competitors, it still has a long way to go when it comes to real-life uses, as it currently takes up too much processing power to have practical uses.
There is a lot of room for improvement, but the current milestones prove we are on the right track.
Deep Learning has come a long way since its early inceptions and Wave2Vec days. Its use in Natural Language Processing came into our radars relatively recently because of computational issues, and we needed to understand more than the tip of the iceberg to comprehend Neural networks and its capabilities.
DevsData is a software and Machine Learning consulting company from New York City with extensive experience in NLP. Also, if you are interested in other programs of Deep Learning, be sure to read our case study on real-time detection for a military company.
DevsData – a premium technology partner
DevsData is a boutique software and recruitment agency. Get your software project done by Google-level engineers or scale up an in-house tech team with developers with experience relevant to your industry.
Free consultation with a software expert
DevsData LLC is truly exceptional – their backend developers are some of the best I’ve ever worked with.”
MENTOR AT YC,