The recent release of ChatGPT by OpenAI has attracted a great deal of attention from the natural language processing community. ChatGPT is a large language model that can generate high-quality responses to human input. It can respond to any question it receives, and it can self-correct previous mistakes based on subsequent conversations.
Despite the impressive capabilities of ChatGPT, it is still unclear whether ChatGPT would be able to perform as a generalist model, delivering state-of-the-art performance for any NLP task in a zero-shot manner. The paper we are looking at today aims to uncover this.
Introduction
The study evaluates ChatGPT's performance on 20 popular NLP datasets covering 7 representative task categories, including reasoning, natural language inference, question answering, dialogue, summarization, named entity recognition, and sentiment analysis. The system being compared in the paper are:
- ChatGPT, where the authors used the web interface directly.
- The latest GPT-3.5 model from OpenAI, which shares some of the properties of ChatGPT.
- Baseline models fine-tuned for each of the target tasks. The authors use models such as FLAN, T0 and PaLM.
Here it is important to mention that both ChatGPT and GPT-3.5 are used purely in a zero-shot manner: The authors do not provide any examples for in-context learning, only instructions for how to solve the task.
Results
The results of the study are in this figure. The main finding is that although ChatGPT is a generalist model that can perform well across multiple tasks, it still tends to perform worse than models that are fine-tuned on a given task. ChatGPT performed well on tasks favouring reasoning capabilities, such as arithmetic reasoning. There are still challenges on specific tasks such as sequence tagging.
When it comes to the difference between ChatGPT and GPT-3.5, the former performs better on natural language inference tasks and question answering, but often underperforms in commonsense, symbolic, and logical reasoning tasks. ChatGPT was found to be better at handling factually consistent text and is superior to GPT-3.5 for dialogue tasks. However, it tends to generate longer summaries and performs worse than GPT-3.5 for summarization tasks.
One limitation of the study is that it only considers ChatGPT and GPT-3.5 for zero-shot learning. As we have seen in previous videos on this channel, the performance of these models can increase significantly with in-context learning, where a few good training examples are inserted in the prompt that is used for the respective task. It remains to be investigated in future work whether the current gap between fine-tuned models and large language models can be closed through techniques such as in-context learning or prompt fine-tuning.
Conclusion
So to sum it up. The paper we looked at today investigated how large language models such as ChatGPT and GPT-3.5 perform against fine-tuned models for various NLP tasks. The results show that although models such as ChatGPT can perform as generalists models, working well across many NLP tasks, there is still a gap with fine-tuned models. This leads to suggest that, if training data is available for your target task, you should definitely consider fine-tuning a small model, rather than jumping straight to using a large language model.
Thanks for reading! If you are looking for state-of-the-art expertise in Natural Language Processing, you should check out our services at The Global NLP Lab.