Should you use a large language model to solve your NLP problem? 4 points to consider in 2023

published on 25 January 2023
Photo by Towfiqu barbhuiya on Unsplash.
Photo by Towfiqu barbhuiya on Unsplash.

The field of NLP has seen rapid advancements in recent years. Large causal language models such as GPT-3 have been at the forefront of this progress, and have been behind the buzz of recent systems such as ChatGPT. These models are trained to predict the most likely next token using huge amounts of raw text from across the whole internet.

The hallmark of these models is their ability to perform in-context learning: they can be primed to solve virtually any NLP task using just a few training examples through a carefully constructed input prompt.

The impressive results lead to the natural question:

When should we use large language models (LLMs) to solve our NLP problem? And when should we shy away from LLMs and stick to simpler, more established methods?

In this article, we will cover 4 important points you should consider before switching to LLMs for your use case. So, let's get started!

We also made a video on this topic, you can check it out here.

1. Task Complexity

The first area to consider is the complexity of the task you're trying to solve. Large language models can perform reasonably well on a wide range of NLP tasks. However, if your task is relatively simple, it is probably an overkill to use a large language model. For example, if you're trying to identify the named entities in a piece of text, or to predict the sentiment of a restaurant review.

These tasks are well studied, and there are a lot of pre-trained models that can solve them. Instead of testing an LLM, your first step should be to try using existing models provided through libraries such as spacy and HuggingFace. Those libraries are very easy to get started, and provide tutorials on how you could fine-tune one of their models on your data.

On the other hand, if your task is much more complex, involving text generation or complex reasoning, a large language model is definitely worth looking into, especially if you have little to no training data.

You will probably have to look into the largest language models provided by OpenAI, such as the latest Davinci model, alternative LLM vendors such as Cohere, or open source alternatives, such as Bloom. These models are easy to use through APIs, and there are a lot of tutorials online to help you get started. LLMs will give you a great starting point, and will allow you to quickly develop a proof-of-concept prototype for your target use case. However, there are many other aspects to consider, especially if you are planning to develop a production application.

2. Data Availability

The second area to consider is the availability of training data for your target task. If you have a large amount of high-quality, task-specific training data available in the target domain and language you are focusing on, then your best bet is probably to fine-tune an existing pre-trained model on your data. For example, if your task is related to text classification, a good approach is to fine-tune a pre-trained BERT model using your dataset. Or if your task involves text-to-text generation, a model such as BART or T5 is probably a good choice.

Fine-tuning a model will allow you to leverage your full training dataset, and you are likely to achieve performance that's on par or better than GPT-3 without having to break the bank for GPU servers or API calls.

However, if you don't have a lot of training data, or if the data that is available is not specific to your target task, domain, or language, it might be difficult to fine-tune a pre-trained checkpoint. In that case, you have two options.

One option is to annotate some training data for your target problem. You can use a paid annotation service such as Amazon Turk. Once you have the data, you can proceed to fine-tune a model as previously discussed.

Annotating data can actually be a very good and relatively cost effective approach, especially if your task is relatively simple, and might be good to do even if you are planning to use a large language model. A few thousand high-quality data points can get you very far for many tasks.

The other option apart from fine-tuning a model is to use a large language model, which might be able to solve the target problem in a few-shot setting, using only a few training examples. To do this, you will need to create a good prompt for the target task, which contains instructions for how to solve the task, as well as a few high-quality examples demonstrating how to do the task well.

3. Performance requirements

Another very important aspect to take into account is the performance requirements of your target application. I know, it is tempting to just throw an LLM to your target problem. But it's important to keep in mind that these systems are still far from perfect. The fact that they are so fluent when generating text makes them that much more dangerous when they do get things wrong, which they still often do.

Using a model blindly hoping for excellent performance without performing thorough benchmarking or user tests can be very risky.

So, before you release that GPT application, you should collect some evaluation data representing the type of outputs you might expect from your model. This dataset should contain sufficient number of diverse examples to ensure that your benchmark is reliable. There might also be good open source datasets out there that could be useful. You can then run your best models on the dataset, compute a score using a relevant metric for the task, and see if the LLMs can really outshine other approaches, or not.

Another aspect to consider apart from the overall performance is the robustness of LLMs for your target task. For that you can look at the variance of the accuracy of your models across evaluation data points.

4. Latency and inference cost

The final area to consider is the target latency time and inference cost of your application, or the amount of time, computational resources, and money you have available to serve the user of your application.

Fine-tuned small NLP models have the benefit that they are relatively cheap to deploy, either to the cloud, or to on-premise hardware. They are also easy to parallelise and scale. Furthermore, they have low inference time, making them perfect for large-scale production applications.

Large language models on the other hand are very computationally intensive and require powerful hardware to run efficiently. This means that it may be challenging to deploy them in a production environment, especially if you need to process large amounts of text data in real-time. The cost of building such a server is also currently very high, requiring an investment of at least several hundred thousand dollars.

If you are not able to make such an investment, your only option might be to use large language models by calling an API. The price there is also not cheap - the best model from OpenAI costs 2 cents for every 1000 tokens that are processed. The cost quickly adds up and is something important to be aware of.

Calling an API, however, also comes with some benefits. You can get started building your application quickly without having to worry about maintaining a server with complex hardware. Furthermore, your total cost scales with your usage requirements of the API, which is good for small businesses.


So to sum it up. Large language models offer a lot of benefits, and are great for building quick prototypes. However they can be an overkill for many simple and well-studied NLP problems. They are definitely worth a try, since they are very easy to test by signing up for an API account.

However, before switching to a large language model for a production application, it is important to consider your specific situation, such as your budget, target tasks, domain, training data availability, and performance requirements.

Thanks for reading! If you are looking to build a state-of-the-art solution in Natural Language Processing, you should checkout our services at The Global NLP Lab: 

Read more