Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

published on 13 March 2023

Today, we are taking a look at an impressive demo system recently released by Microsoft, called Visual ChatGPT.

Visual ChatGPT bridges the gap between conversational language models and visual foundation models by combining the two: users are capable of sending and receiving images, as well as performing reasoning and image manipulation on the images.

So let’s get started!


Large language models (LLMs) have made incredible progress in recent years, with ChatGPT being a significant breakthrough. However, ChatGPT is limited in its ability to process visual information, which is where Visual Foundation Models (VFMs) come into play.

VFMs, such as BLIP and Stable Diffusion, have shown great potential in computer vision but are less flexible than conversational language models in human-machine interaction.

One challenge is building a ChatGPT-like system that also supports image understanding and generation. Another challenge is incorporating modalities beyond languages and images. The team behind Visual ChatGPT answers these challenges by proposing a system that directly incorporates a variety of VFMs into ChatGPT.

Visual ChatGPT allows users to interact with ChatGPT in new ways. Firstly, users can send and receive not only languages but also images. Secondly, users can provide complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multiple steps. Thirdly, users can provide feedback and ask for corrected results.

Architecture of Visual ChatGPT

Screenshot 2023-03-11 at 21.15.32-j6zdk

As shown in the figure, Visual ChatGPT has an architecture that incorporates different Visual Foundation Models. A user uploads an image and enters a complex language instruction. With the help of the Prompt Manager, Visual ChatGPT starts a chain of execution of related Visual Foundation Models. The Prompt Manager serves as a dispatcher for ChatGPT by providing the type of visual formats and recording the process of information transformation. Finally, when Visual ChatGPT obtains the hints of “cartoon” from the Prompt Manager, it will end the execution pipeline and show the final result.

In terms of models, the authors use the latest GPT3 checkpoint, and implement the prompt manager using the langchain library. The authors integrate 22 Visual Foundation Models in total, covering models for image-to-text such as Stable Diffusion, text-to-image, such as BLIP, image style transfer, such as Pix2Pix, and ControlNet, for fine-grained image synthesis. All of these models are accessable by ChatGPT, that can use and call them to manipulate and generate images as needed.

What I find remarkable about Visual ChatGPT is that it's implemented in only 1000 lines of code. There is fairly complex logic that dictates the behaviour of the chatbot, and handles things such as accessing files, and generating diverse prompts for each visual model, however in overall, building the system seems to be relatively straightforward.

The code of Visual ChatGPT is also available.

Example dialogues

So let's look at some examples of what kind of dialogues Visual ChatGPT is capable of.


In this dialogue, the user asks for an image of an apple: "I like drawing, but I'm not good at drawing, can you help me? like drawing an apple.", and Visual ChatGPT generates a drawing.

Then, the user provides a sketch of an apple and a glass, and asks for the image to be improved "The image is my sketch of an apple and a drinking glass, can you please help me to improve it?". Visual ChatGPT creates a real image based on the sketch.

Then, the user asks "Looks good. Can you make the image into a watercolor painting?", and gets the result.

Then the user asks a semantic question about the image: "Can you tell me what color this background is?".

Finally, we see some examples of manipulating the image: removing the apple "Can you remove this apple in this picture?", and switching the style of the table "can you help me to replace the table with a black table?".

We can see that Visual ChatGPT is capable of understanding human intents, and accomplish complex multimodal goals for the user. There are several more example conversations to check out in the paper.


Visual ChatGPT is a remarkable prototype put together by Microsoft that combines several advanced models into a single system. The models are orchestrated by ChatGPT, which is capable of accessing them and using them to realise the goals of the dialogue. The results seem very promising.

What's amazing is that putting together such a complex system can nowadays be achieved in a relatively simple way through the powerful APIs and libraries we have available to play with. All modules are united by the powerful capacity of the LLM to generate and interpret precise instructions.

Of course, Visual ChatGPT doesn't come without limitations. For one thing, such a complex system requires a fair amount of manual prompt engineering and assembly, that is prone to mistakes. Errors produced by one of the modules might lead to error propagation all throughout the dialogue.

However, still, Visual ChatGPT is an impressive system that opens up the potential of building even more complex prototypes in the future. And the fact that most of the components of Visual ChatGPT are either open source or easily accessible is also amazing.

Thanks for reading! If you are looking for state-of-the-art expertise in Natural Language Processing, you should check out our services at The Global NLP Lab.

Read more