Natural language processing (NLP) is a branch of science sitting at the intersection of computer science, artificial intelligence, and computational linguistics. To put it simply, it gives computers the ability to read, understand and generate text and speech just like we do. Within the last decade, NLP based products have become ubiquitous. Here are some common NLP applications/products you might be aware of:
- Amazon Alexa or Siri
- Voice input in Google or Apple Maps
- Conversational chatbots
- Google or Siri Translate
From a computer vision scientist’s perspective, NLP is becoming increasingly important as time goes by. From the use of pre-trained embeddings to vision transformers, we are starting to observe how architectures used in both fields have concepts in common. For all aspiring AI practitioners, it is important to know the basic algorithms in both fields.
NLP can be broadly broken down into 4 key areas:
- Language Modeling
Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language model also assigns a probability for the likelihood of a given word (or a sequence of words) to follow another sequence of words. 
Morphology is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words such as stems, root words, prefixes, and suffixes. 
- Parsing / Syntactic Analysis
Syntactic analysis, also referred to as syntax analysis or parsing, is the process of analyzing natural language with the rules of a formal grammar. Grammatical rules are applied to categories and groups of words, not individual words, thereby assigning a semantic structure to text. 
- Semantic Analysis
Semantic analysis refers to the process of understanding the meaning behind sentences. it is a study of what words mean when they are put together in a sentence. It takes into account the individual meaning of words, how they relate and modify neighboring words, and the context these words appear in.
Modern day NLP mostly relies on deep neural networks(DNN) for predictions. NLP engineers have to make 3 important choices before they train any neural network:
- Defining the “Task”
- Choosing word embeddings
- Choosing the neural network architecture
Defining the “Task”
Here are some broad categories defined in the AllenNLP model library. Code + paper + weights are available at Papers with Code website.
Here are some of the most common tasks:
a. Classification : Predicting one or more labels for each input. For ex. Sentiment analysis
b. Coreference Resolution: The goal is to find all expressions that related to the same entity.
c. Generation : Generating unstructured text of variable length. For ex, GPT-3
d. Language Modeling : Learning a probability distribution over sequence of tokens.
e. Sequence Tagging : Information extraction task where each word in a sentence is classified with a pre-defined label set.
PS: This is not an exhaustive list.
Choosing word embeddings
NLP deals with massive vocabularies, ranging from 100k to millions of words. It would be inefficient to train a network if each word is represented with a vector of size V (where V is the size of this vocabulary). Word embeddings solve this problem as they reduce the dimensions of the encoding (compared to V), while adding meaning to the vectors. Here are some common word embeddings:
This is a neural network layer with weight dimensions (V x N), where V is the size of the vocabulary, and N is the embedding dimension. It is initialized with random values and the word embeddings are learnt during the training phase. [Detailed Explanation]
Word2Vec was developed by Mikolov et al. paper in 2013. This technique uses the local context around a word to develop a meaningful representation for it. It has two flavors:
- Continuous Bag Of Words Model
- Skip-gram Model
More details can be found here
GloVe (Global Vectors for Word Representation)
GloVe was developed by Manning et al. At Stanford University. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. More details can be found here
- ELMO (Embeddings from Language Models)
All the embeddings discussed so aim to provide 1 embedding per word, but we know that the meaning of a word changes with context. To solve this problem, ELMO trains custom deep embeddings for each word, based on the context/usage. For example, the word “right” will have different embeddings for the following sentences:
- Take the next right.
- You are right about that problem.
- BERT (Bidirectional Encoder Representations from Transformers)
Both Elmo and Bert are deep embeddings which involve training wider/deeper neural networks on a big corpus. The key difference between the two lies in the training objective.
- ELMO uses the traditional Language Modeling training objective to generate the next word given previous words in the sentence. It tries to calculate P(Xi| Xi-1, Xi-2, Xi-n) such that the
resulting distribution mimics the distribution in the training corpus.
- Bert uses the ‘masked’ lanuage modeling objective. It randomly masks words (with some probability) in the training corpus, and uses the surrounding words to predict the missing word.
- ELMO uses the traditional Language Modeling training objective to generate the next word given previous words in the sentence. It tries to calculate P(Xi| Xi-1, Xi-2, Xi-n) such that the resulting distribution mimics the distribution in the training corpus.
Choosing the neural network architecture
There are two classes of architectures we should look at:
Traditional Sequence models
The most commonly used sequence models are :
a. RNN (Recurrent neural networks)
Recurrent neural networks are a class of networks that consume sequential inputs. They update the current state(memory) based on the previous state and the current input.
b. LSTM (Long short term memory) networks
LSTM is a type of RNN that gets rid of the vanishing gradient problem by regulating the gradient flow using gates.
c. GRU (Gated recurrent unit) networks
GRU is a simplified version of LSTM, with only 2 gates instead of 3. It’s faster to train as compared to LSTMs and provides similar performance metrics on a variety of datasets.
A step by step explanation on these models can be found at Colah’s Blog.
Transformer networks are one of the most widely used architectures in NLP over the last 4 years. The architecture deploys a novel use of the familiar concepts of ‘self attention’ , ‘multi head attention’ and positional encodings inside an encoder-decoder design , thereby enhancing the performance of the model. GPT-3 uses a modified version of this architecture and almost passed the turing test for short articles !
Please refer to this blog for more details on transformers.
Paid APIs for NLP
The API industry for NLP has really blossomed over the years. You can find APIs for all the tasks listed here. Here are a few APIs that I came across:
- Google Cloud
- NLP Cloud
- IBM Watson
Python Libraries for NLP
I’ve been experimenting with AllenNLP, which provides pre-built models, encoder/decoder blocks, vocabulary builder, indexer, and tokenizers. It’s a layer on top of Pytorch and seems pretty intuitive. Moreover, the documentation provided is sufficient to get started with, which isn’t the case with TorchText as of now. Other than that, they don’t differ a lot. Here are some open source python libraries which can be used for model training and inference:
- TorchText : Torch Library
- Spacy : This is equivalent to ‘scipy’ for NLP !
- HuggingFace: primarily used for transformers, sharing open-source models, datasets and metrics
- AllenNLP: research framework built on pytorch
This concludes this short intro to NLP. I’ll try to snipe at certain topics and provide a detailed analysis in the upcoming articles. Thanks for reading!
- Page 105, Neural Network Methods in Natural Language Processing, 2017.
- Introduction to NLP
- A Survey of the Usages of Deep Learning for Natural Language Processing
- Allen Institute for AI
- The Illustrated BERT, ELMo, and co.