In-context Learning - A New Paradigm in NLP?

published on 08 January 2023
Photo by Markus Spiske: https://www.pexels.com/photo/coding-script-965345/
Photo by Markus Spiske: https://www.pexels.com/photo/coding-script-965345/

In-context learning (ICL) is an exciting new paradigm in NLP where large language models (LLMs) make predictions based on contexts augmented with just a few training examples. LLMs are able to extract patterns from the examples provided in the context, and use them to perform many complex NLP tasks.

We'll be discussing recent optimisations of ICL, covering methods for finetuning LLMs for ICL, how to pick good traning examples for ICL, and how to design great prohow to design great prompts.

So if you're curious about ICL, this is the video you need to watch. And stay until the end of the video to learn more about the mystery of why ICL works so well! So, let's get started!

This blog post was composed using the information provided in this paper. We also made a video using this information, check it out here

How does In-context Learning work?

Here is an illustration of how in-context learning works in practice:

Figure from https://arxiv.org/abs/2301.00234. 
Figure from https://arxiv.org/abs/2301.00234. 

ICL requires preparing a context that contains a few examples, in this image for the task of sentiment analysis. The examples are arranged to follow a template written in natural language. For example, "Review: Delicious food!" and "Sentiment: Positive". After the examples, a query is inserted in the context. A query is basically an unseen input formatted according to the same template, "Review: Good meal" in the figure.

The full context (also called a prompt in other papers) is presented to a large language model (LLM) that uses the examples to autocomplete the query, resulting in a prediction for the missing sentiment.

ICL works not only for simple tasks such as classification, but even for complex text generation and reasoning tasks, such as summarisation, commonsense reasoning, or even data annotation. The power lies in the flexibility of the approach: the same LLM can be used to tackle multiple tasks just by presenting a different context.

Procedures of in-context learning. 
Procedures of in-context learning. 

Now, how does ICL typically work in practice? This figure on the right illustrates a typical ICL process. Each step on the figure plays a factor in ICL and can be optimised to yield better performance. We're going to discuss a few of those steps now.

Pre-training or selection of the Large Language Model (LLM)

The first step is, of course, the pre-training of the LLM, or selecting a pre-trained model to be used out of the box, such as GPT-3 or Bloom. It's important to note that the ICL capacity of LLMs seems to increase as the number of model parameters or the number of pretraining steps increases, so keep that in mind.

Once you have a pre-trained LLM you want to use, you can either use it as it is, or you can optionally do something called model warmup.

Model warmup

Model warmup fine-tunes the LLM to specifically increase its capacity for in-context learning during inference. Let's look at two ways to do model warmup: supervised in-context training and self-supervised in-context training.

Supervised in-context training fine-tunes the LLM on a dataset containing a broad range of tasks prepared in ICL formats similar to the format you would like to use for your task. The training improves the overall few-shot ICL abailities of the model, because it learns to process diverse ICL examples. Supervised in-context training works best when the tasks and datasets used are close to the domain of the target task the LM will ultimately be used for.

Self-supervised in-context training, on the other hand, involves constructing self-supervised training data based on the ICL formats of downstream tasks. Basically, you use the frozen LLM to generate some synthetic training data for your target task in the target ICL format, then you fine-tune the LLM on that synthetic training data. This approach has been shown to improve the ICL capacity of the model, since it becomes more focused on the specific format that will be presented during inference.

In overall, model warmup has the benefit that the model becomes more sensitive towards ICL, leading to better ICL performance. However, keep in mind that model warmup requires updating the model weights, or adding additional weights to the model. This is prohibitive for LLMs, since it might require having multiple versions of the LLM for each target task.

Prompt design

Now, let's talk about another step of ICL, which is prompt design, or demonstration desgin. This step is very important, and can make or break your model's performance. There are two main areas of prompt design: demonstration organization and demonstration formatting.

First, let's focus on demonstration organization. This involves selecting and then ordering a subset of the training examples you have available for your target tasks. The selected examples will be used as demonstrations to the LLM.

Example selection can be either unsupervised or supervised. Unsupervised methods for selection include using pre-defined metrics like L2 distance or cosine-similarity distance to select the closest neighbors as demonstrations, or selecting prompts with low perplexity. Supervised methods involve using a scoring language model to evaluate the concatenation of each candidate example and the input, labeling high-scoring candidates as positive examples and low-scoring candidates as hard negative examples. Reinforcement learning has also been used to model the example selection process.

Now for ordering the examples, which can also play a big role. Previous studies have proposed training-free methods for sorting examples, like sorting by distance to the input or using entropy metrics.

Let's move on to demonstration formatting. This involves the design of the prompt itself, including its language and structure. One aspect of formatting is the format of the instructions of the prompt. Good instructions are key to good ICL performance, but they can be hard to come by because they rely on human-written sentences. LLMs can actually help to generate task instructions on their own, given several demonstration examples. Methods like Automatic Prompt Engineer (APE) use LLMs for automatic generation and selection of instructions.

In summary, prompt design is crucial to ICL success and there are many approaches to optimize different aspects of the prompts to boost performance. Optimising each of the prompt components is likely to have a big impact on the performance and robustness of ICL, so make sure you don't skip that.

Performance of ICL

On traditional tasks and benchmarks such as SuperGLUE and SQUAD, there is still some room of improvement of ICL compared to finetuning, although the gap is narrowing. This is perhaps due to the fact that fine-tuned methods are trained using the whole training dataset, unlike ICL which only sees a few demonstrations.

The intrinsic capabilities of language models have prompted researchers to propose new more challenging benchmarks that fit better with the few-shot nature of ICL. There have been promising results on benchmarks such as big-bench, where ICL methods outperform human-rater results in 65% of tasks.

Why does ICL work so well?

Given all of these capabilities of ICL, it is interesting to ask the question - Why does ICL work so well? We are, after all, getting this amazing capacity "for free" as a result of the causal LM training objective. How are LLMs able to do this?

There have been a few studies aiming to uncover this in the literature.

One factor that might play a role is the distribution of the training data. When training on a very large dataset, the ICL ability of LLMs seems to emerge when the data appears in clusters and there are a sufficient number of rare classes present.

Another factor is that Transformer models might be learning to encode learning algorithms implicitly during the training process, due to the properties of their architecture. During inference, transformer LLMs might be performing an implicit finetuning using the provided examples in the context. This is certainly very fascinating.

Conclusion

So there you have it - an in-depth overview of in-context learning in 2023. It is certainly a very exciting field, and we are looking forward to seeing how it evolves. If you're looking for more details, check out this review paper

If you are looking to start a project in NLP using Large Language Models such as GPT-3, and would like to get advice from an advanced team of experts with PhDs that have been working in the area for 5+ years, you should get in touch with us at The Global NLP Lab. Looking forward to talking to you! 

Read more