BERTology Transfer Learning in Natural Language Processing

6 minute read

Published:

initial_googleai
BERT: Google AI

Representation Learning is of prime importance in machine learning tasks; it is ultimately the input data representations that play a significant role in determining the model’s performance. There are scenarios where learning representations of the input data require transferring of knowledge from related but different tasks. This is where transfer learning comes in. Transfer Learning helps us utilize a more generalizable set of features and fine-tune them for the downstream task. To know more about transfer learning, please go through my blog linked here

In my recent NLP work, I have been using BERT and related models for certain downstream tasks like text classification on domain specific data. I am writing this blog with the motive of introducing you to the recent boom in transfer learning in natural language processing with the advent of architectures like BERT.

I will focus on the description of BERT and talk about using them in your works using Hugging Face.

Motivation

transformers
Transformers: Image from Jay Ammar

The 2017 OpenAI’s paper “Transformers: Attention is all you need” brought about a revolution in the field of deep learning in natural language processing by performing better than traditional recurrent neural network architectures like LSTMs and GRUs. In general, transformers are seen to handle long-term dependencies in the text better than LSTMs, which led to them performing significantly better than LSTM based architectures on tasks like Machine Translation.

Transformers are the reason for the birth of BERT and the series of other architectures that follow. Motivated by the developments in Computer Vision, NLP researchers were eyeing to create a model capable of generating better input text representations that could be used directly for downstream tasks.

Using the decoder layer of transformers, we could train it for language modeling, i.e., next word prediction task. We can train the stack of decoders of the Transformers on a massive corpus of data and allow it to learn the “context” of the language. This trained transformer decoder stack can then be used for obtaining better representations of input data for downstream tasks like text classification.

Entry of BERT

bert_pretraining
BERT Pre-Training and Fine-Tuning Procedure

Well, everything fine until now. But there is a catch. Transformer’s decoder architecture mentioned above doesn’t learn “bi-directional” contextualized embeddings, making it less sophisticated than ELMo (Bi-Directional LSTM based architecture for generating dynamic contextualized embeddings).

Well, BERT said why not combine the idea of both ELMo and Transformers?

BERT stands for Bidirectional Encoder Representations from Transformers. It uses the transformer encoder part and trains a “masked language model” randomly masking 15% of its input tokens. The task is to predict these tokens correctly.

BERT introduces another task in the pre-training – “The next sentence prediction” for incorporating sentence level knowledge. The task here is to determine if the second sentence follows the first one. For doing so, BERT introduces its special way of tokenization. BERT’s first token is the token, which contains information about the next sentence classification task. Apart from this, the two sentences are separated by a special token.

In addition to the single-sentence classification task and single sentence tagging task, this additional pre-training mechanism for next sentence prediction allows BERT to solve various problems. These include sentence pair classification tasks like Natural Language Inference tasks and question answering tasks.

bert_types
BERT Paper showing different tasks

USING BERT

If you have been wondering when do I get to learn how to use BERT in my project, then thanks for hanging on until this stage. Here we will talk about two popular ways of using BERT: -

  1. For Generating Contextualized Embeddings
  2. For Fine-Tuning.

Before diving into it, let us set up our environment for running BERT.

We will use both Pytorch and Tensorflow 2.0 with Hugging Face for running BERT. Hugging Face is a NLP startup which release open source NLP state-of-the-art models which can be used by anyone.

huggingface
Hugging Face

These models run with the support of deep learning libraries like Pytorch or Tensorflow 2.0.

Having known this, let us install the transformers module which contains the actual implementation BERT and related models. pip install transformers.

For installing Pytorch -> “pip install pytorch”

For installing Tensorflow 2.0 -> pip install tensorflow

(You could do conda install as well, depending on your virtual environment). (Note: Using anaconda is highly recommended).

Here, I will highlight the main steps involved in running BERT the full code is made available on GitHub along with a dataset to experiment on here.

TOKENIZATION:

tokenization
Tokenization Procedure in BERT

After data pre-processing, tokenization is the first step of all natural language processing tasks. We break the sentences to words/ sub-word chunks and use this for generating features that can be used to train the model.

The Hugging Face Transformers comes with its own tokenizer which varies as per the BERT model you are using.

The entire process of tokenization can be further divided into three parts: -

  1. Converting the sentence/ sentence pair into numerical tokens.
  2. Padding the sequences to a maximum length.
  3. Defining the attention masks and the token_type_ids

Attention masks – These indicate whether the token corresponds to a word or refers to padding.

For normal words 1 is used as masking number. For padding tokens 0 is used as the masking token.

Token_Type_Ids – This is a special id which marks the difference between the pair of sentences in case of a sentence pair input. The first sentence from the token till the is represented by 0 and the second sentence uptil is represented by 1.

code
Code Snippet for automatic Tokenization. For manual tokenization check my GitHub link

Having finished the tokenization phase we can now either train a classifier on top of the embeddings obtained from BERT or we could tune the entire BERT architecture for about 2-4 epochs. In general, it is observed that fine-tuning gives better performance over using the embeddings directly.

The entire code for this can be found on my GitHub repository here. Please do check it out.

The repository contains two codes -

  1. bert_embeddings.py (Making use of embeddings for classification) (Coded in Hugging Face with Tensorflow 2.0)
  2. bert_fine_tune.py (Fine tuning BERT on downstream task) (Coded in Hugging Face with Pytorch)

References and Acknowledgements:

  1. Jay Ammar’s Blogs on Transformers and BERT
  2. Transformers: Attention is All you need
  3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding