Machine Learning

Recent Progress in Language Modeling

This page is a high-level summary / notes of various recent results in language modeling with little explanations

Ran Ding

Oct 9, 2018 • 3 min read

Overview

This page is a high-level summary / notes of various recent results in language modeling with little explanations. Papers to cover are as follows:

[1] AWD Language Model

Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. "Regularizing and optimizing LSTM language models." arXiv preprint arXiv:1708.02182 (2017).

[2] Neural Cache

Grave, Edouard, Armand Joulin, and Nicolas Usunier. "Improving neural language models with a continuous cache." arXiv preprint arXiv:1612.04426 (2016).

[3] Dynamic Evaluation

Krause, Ben, et al. "Dynamic evaluation of neural sequence models." arXiv preprint arXiv:1709.07432 (2017).

[4] Memory-based Parameter Adaptation (MbPA)

Sprechmann, Pablo, et al. "Memory-based parameter adaptation." arXiv preprint arXiv:1802.10542 (2018).

[5] Hebbian Softmax

Rae, Jack W., et al. "Fast Parametric Learning with Activation Memorization." arXiv preprint arXiv:1803.10049 (2018).

[6] Higher-rank LM / Mixture-of-Softmax (MoS)

Yang, Zhilin, et al. "Breaking the softmax bottleneck: A high-rank RNN language model." arXiv preprint arXiv:1711.03953 (2017).

This is by no means an exhaustive literature review - they are only a selection of a few of the most recent state-of-the-art results. AWD LM [1] has almost become the de-facto baseline LM for many of the other papers, where the main innovations area special version of Averaged SGD (ASGD) along with DropConnection based Weight Dropping regularization in the hidden -to-hidden mapping of a LSTM model.

It has been found a global LM is ineffective in reacting to local patterns at test time, such as once a rare word appears furthe reappearance in the peoximty is much more likely than predicted by a global LM. To allow for faster reaction to local patterns, [2 - 5] propose various schemes involving a fast-learning non-parametric component and blend its predictions or parameters with the global learned parametric LM. A quick comparison of these 4 papers are in the table below.

Ref	Method	Modifications to training?	Adapation needed at test time?
[2]	Keeping key-value store with keys being previous (fixed size) output hidden states and value being correct labels. This non-parametric cache provides a local LM based on nearest-neighbor lookup. This is then interpolated with global LM for final prediction.	No	No
[3]	Similar to [2] but instead of doing nearest-neighor over saved hidden-steates, here we fit recent history with gradient descent thus providing a slightly adjusted model, i.e. parameters are adapted, not just predictions, to recent history. One concern I would have is whether the continuous adapation would let the model run away too far from the initial trained model.	No	Yes
[4]	Similar to [3], but the test-time gradient descent produces a local model that is discarded after use for prediction, i.e. unlike [3] the change of paramters due to local memory does not carry over to next time step. Thus this is quite closely related to meta-leanring. Another minor point, the gradient descent does not go through the full network, but stops at the so-called embedding layer, which is usually a layer close to the output, extracting fairly abstract features.	No	Yes
[5]	Recent output hidden states are accumulated into one vector using exponential moving average and then directly updated to output linear mapping parameter matrix. Two sets of update rules are used at training. Non-parametric leanring are tapered off as words are seen more frequently. Different from [2-4], this method incorporates fast learning at training time not just fast adapation at test time.	Yes	N

Table 1. Comparison of methods in Ref [2-5]

And finally, [6] highlights and mostly solved a fairly general problem of softmax over product produced by rank-limited matrices which is common in the decoder in a LM.