XLNet — A new pre-training method outperforming BERT on 20 tasks

In 2018, Google published bidirectional, transformer-based pre-training of large scale language model BERT, breaking 11 state-of-the-art records in Natural Language Processing. It has brought great excitement for the NLP field.

Very quickly, BERT has spread like wild fire within the research community, derivative research work have started to emerge.

While the shockwaves BERT created have yet to calm down, another a brand new model emerged today.

Researchers from Carnegie Mellon University and Google Brain propose a new pre-training language model, XLNet, which surpasses BERT in 20 tasks such as SQuAD, GLUE, and RACE.

So, what improvements does XLNet have over BERT?

According to the author, the pre-training model based on denoising self-encoder can maintain better bi-directional context information, and the performance is better than the pre-training method based on autoregressive language model. However, due to the need for a part of the input to be masked, BERT ignores the dependencies between the mask positions, so there is a difference in pretrain-finetune discrepancy.

Based on these trade-offs, this study proposes a generalized autoregressive pre-training model XLNet.

In short, XLNet can

  1. Learn bi-directional context information by maximizing the log likelihood of all possible factorization sequences;
  2. Overcome the shortcomings of BERT with the characteristics of auto-regression.

In addition, XLNet also incorporates the ideas of ​​the current best autoregressive model, Transformer-XL.

Finally, XLNet surpassed BERT’s performance on 20 tasks and achieved state-of-the-art on 18 tasks, including machine QA (question-answering), NLI (natural language inference), sentiment analysis, and document ranking.


Many of the models that had surpassed BERT in the last few months have been modifications on the basis of it. In essence, the model architecture and tasks have not changed much. However, in this new paper, the author analyzes the current pre-trained language models from the two paradigms of autoregressive and auto-encoding, and finds that although they each have their own advantages, they also face challenges that are difficult to solve. To this end, the researchers proposed XLNet in hopes to combine the best attributes of both approach.

Faced with the pros and cons of existing language pretraining objectives, in this work, we propose XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations.

  • Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.
  • Secondly, as a generalized AR language model, XLNet does not rely on data corruption. Hence, XLNet does not suffer from the pretrain-finetune discrepancy that BERT is subject to. Meanwhile, the autoregressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT.

In addition to a novel pretraining objective, XLNet improves architectural designs for pretraining.

  • Inspired by the latest advancements in AR language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.
  • Naively applying a Transformer(-XL) architecture to permutation-based language modeling does not work because the factorization order is arbitrary and the target is ambiguous. As a solution, the author propose to reparameterize the Transformer(-XL) network to remove the ambiguity.


To better understand the difference between BERT and XLNet, let’s consider a concrete example:

[New, York, is, a, city]

Suppose both BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize log p(New York | is a city). Also suppose that XLNet samples the factorization order [is, a, city, New, York].

In this case, BERT and XLNet respectively reduce to the following objectives:


Notice that XLNet is able to capture the dependency between the pair (New, York), which is omitted by BERT. Although in this example, BERT learns some dependency pairs such as (New, city) and (York, city), it is obvious that XLNet always learns more dependency pairs given the same target and contains “denser” effective training signals.

Results comparison on machine comprehension

Single model XLNet outperforms human and the best ensemble models on SQuAD 1.1 and 2.0

RACE, a reading comprehension task with Middle and High School reading comprehension exam.

On the RACE dataset, XLNet outperforms the best ensemble by 7.6 points in accuracy. Equivalent of a student improving from grade B to grade A!


Pretrained Models and Code

The progress in Natural Language Processing research breakthroughs has been accelerating in 2019. Very soon, we will have machines that can understand our language better than we can.

Are you ready for the revolution?

Westworld Natural Language Generation