DocArray and the Otter Model: Native multi-modal AI meets native multi-modal data structures

We all love when a perfect couple finds each other, like DocArray and Otter.

Two otters face each other, holding a heart with "OTTER" and "docarro" logos, surrounded by greenery

Some things go together, like peanut butter and jelly, tea and biscuits, gin and tonic, or DocArray and Otter.

Welcome to DocArray!
:arrow_up: DocArray v2: We are currently working on v2 of DocArray. Keep reading here if you are interested in the current (stable) version, or check out the v2 alpha branch and v2 roadmap! DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, v…
Otter

We live in a world where information is usually multimodal — where texts, images, videos, and other media co-exist and come mixed together. The emergence of large language models has transformed the way we process, understand, and gain insights from this kind of complex, poorly structured multimedia data.

The nature of the Otter model — reading in combinations of images, videos, and texts as prompts — makes DocArray almost perfectly suited to it.

With DocArray, you can represent, send, and store multi-modal data easily. It integrates seamlessly with a wide array of databases, including Weaviate, Qdrant, and ElasticSearch, presenting users with a consistent API format for vector search. With DocArray, you can navigate your multimodal data effortlessly.

DocArray was designed for the data our complex, multimedia, multi-modal world naturally generates, exactly the same kind of data that Otter was designed to process. This article will show you how to bring the two together to get the most out of the Otter model.

The Otter model

The Otter Model is a multimodal model fine-tuned for in-context instruction. It has been adapted from the OpenFlamingo model from LAION, which is an open-source reproduction of DeepMind's Flamingo model. To train it, Otter’s developers created a partially human-curated dataset of 2.8 million instruction sets with examples.

The way in-context instruction works is to not just ask the AI model to respond to some prompt but to give it, as part of the prompt, a few examples of how you want it to respond. Instead of “zero-shot learning”, where the model responds without new information, this is “few-shot learning”, where it’s given a bit of positive instruction in the form of a small example set of inputs and appropriate responses.

For example:

Now using the same images, the same question, but different answers in the examples:

You can see in this example how Otter learns from the two examples given in each prompt what class of answer is required. When the question is “Where was this picture taken?” “Paris” is the same kind of answer as “Rome” and “London”, while “at the Eiffel Tower” is the same kind of answer as “at the Colosseum” and “near Tower Bridge”. This is the kind of processing Otter was trained for.

Building multi-modal prompts with DocArray

Prerequisites

First, Otter is an enormous model and will require a great deal of memory and very powerful GPUs. You will need either two GTX 3090 24G or a single A100.

⚠️
Otter has a very high hardware requirement. You may have to rent a compatible cloud instance.

If you have secured the necessary hardware, the next step is to install DocArray. Assuming you have Python and pip installed and they are on the path, perform the following command in a terminal:

pip install docarray

Next, download the environment for the Otter model from GitHub:

git clone https://github.com/Luodian/Otter.git

Finally, use conda to create a new virtual Python environment and install all necessary dependencies in it:

conda env create -f environment.yml

The file environment.yml is in the root directory of the git downloaded in the previous step.

Creating the classes and functions with DocArray

A multi-modal prompt is a combination of different data types in a single prompt. This is where DocArray can shine because it allows us to tap into the richer, multi-faceted contexts that diverse data types offer.

For this article, we will build a small application that answers questions about pictures, with prompts that contain example images with example answers.

First, we will define the prompt object format in DocArray:

from docarray import BaseDoc
from docarray.typing import ImageUrl

class OtterDoc(BaseDoc):
    url: ImageUrl
    prompt: str
    answer: str

A prompt contains an image URL, a string prompt, and a string answer.

Now, we can create a prompt to pass to the model containing multiple instances of OtterDoc, using the three images below:

from docarray import DocList

docs = DocList[OtterDoc](
    [
        OtterDoc(
            url="https://upload.wikimedia.org/wikipedia/commons/1/1f/Colosseo_Romano_Rome_04_2016_6289.jpg",
            prompt="Where was this picture taken?",
            answer="Rome",
        ),
        OtterDoc(
            url="https://upload.wikimedia.org/wikipedia/commons/a/a3/Westminster%2C_2023.jpg",
            prompt="Where was this picture taken?",
            answer="London",
        ),
        OtterDoc(
            url="https://upload.wikimedia.org/wikipedia/commons/d/d4/Eiffel_Tower_20051010.jpg",
            prompt="Where was this picture taken?",
            answer="",  ## leave this blank and Otter will answer
        ),
   ]
)
💡
You can provide more than just two examples to the Otter model. Answer inference will be based only on the last image and prompt.

The Otter model has been trained with in-context instruction-response pairs, so it requires a specific text template to correctly process its input.

prompt = f"<image>User: {first_instruction} "
	   + f"GPT:<answer> {first_response}<endofchunk>"
       + f"<image>User: {second_instruction} GPT:<answer>"

The User and GPT role labels are essential and must be used in the way shown above.

Let’s also write a helper function to convert a DocList containing PromptDoc instances into a prompt:

def format_prompt(docs: DocList) -> str:
    conversation = ""
    for doc in docs:
        conversation += f"<image>User: {doc.prompt} GPT:<answer> {doc.answer}"
    if len(doc.answer) != 0:
        conversation += "<|endofchunk|>"
    return conversation

We will also need a list of images ordered to match the text prompts. For this, we're going to use the load_pil() function from DocArray:

images = [doc.url.load_pil() for doc in docs]

Building the inference engine

Next, we will need to import Otter. It contains functions that automatically download the Otter model:

from otter.modeling_otter import OtterForConditionalGeneration
import transformers

model = OtterForConditionalGeneration.from_pretrained(
    "luodian/otter-9b-hf", device_map="auto"
)
tokenizer = model.text_tokenizer
image_processor = transformers.CLIPImageProcessor()

Now that we're all set, we can preprocess the input text and image data into Otter’s input vectors:

vision_x = image_processor.preprocess(images, return_tensors="pt")["pixel_values"]
	.unsqueeze(1)
    .unsqueeze(0)

model.text_tokenizer.padding_side = "left"

lang_x = model.text_tokenizer([format_prompt(docs)], return_tensors="pt")

Finally, we can query the Otter model with the preprocessed input:

generated_text = model.generate(
    vision_x=vision_x.to(model.device),
    lang_x=lang_x["input_ids"].to(model.device),
    attention_mask=lang_x["attention_mask"].to(model.device),
    max_new_tokens=256,
    num_beams=3,
    no_repeat_ngram_size=3,
)

print(f"Result: {model.text_tokenizer.decode(generated_text[0])}")

For the example given, this will produce:

Result: This picture was taken in Paris.

This is correct, recalling that this is the image Otter is answering questions about:

Beyond in-context prompting

DocArray and Otter are useful for more than just multi-modal in-context learning. It can support any multimedia prompt. For example, consider this prompt:

prompt = "<image>User: tell me what's the relation between " 
	   + "the first image and this image <image> GPT:<answer>"

We could readily construct a DocArray extension for these kinds of queries.

For example, using the two images below:

When we did it, we got results like this:

The first image shows two cats sleeping on a couch, while the second
image shows a forest and the sunlight shining through them. The
relation between these two images is that the cats are resting in a
cozy indoor environment, which contrasts with the natural setting
of the forest. The sunlight streaming through the trees, and the
forests adds a sense of warmth and tranquility to the scene, creating
a peaceful atmosphere.

This does not exhaust the possibilities for constructing multi-modal queries with DocArray and Otter.

Conclusion

So, we've had a look at the Otter model and the powerful DocArray library and how they go together like macaroni and cheese. Joining Otter with DocArray brings multi-modal AI to life, enabling it to interpret and respond to mixed data types like images and text in a way that feels too natural to resist.

And hey, we didn't just talk about it, we showed you just how easy it is to build a multi-modal prompt.

Explore the possibilities

Check out DocArray’s documentation, GitHub repo, and Discord to explore multi-modal data modeling and what it can do for your use case.

Welcome to DocArray!
:arrow_up: DocArray v2: We are currently working on v2 of DocArray. Keep reading here if you are interested in the current (stable) version, or check out the v2 alpha branch and v2 roadmap! DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, v…

You can also check Otter's GitHub repo and paper.

GitHub - Luodian/Otter: 🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind’s Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind&#39;s Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning abili…