How Embeddings Drive AI: A Comprehensive Presentation

Embeddings are a core concept and technology in AI, present in some form across the spectrum of AI applications. Properly understanding and mastering this technology positions you to leverage AI models to add the most value to your business.

Dynamic abstract art with 3D gradient lines in blue, purple, and pink, creating a sense of depth and movement

Embeddings are a core concept and technology in artificial intelligence, used in some form in practically all recent AI models, including ChatGPT and image generators like Stable Diffusion. It is one of the fundamental technologies behind the most visible and useful AI applications.

Embeddings are also important to less flashy applications, like search, visual face and object recognition, machine translation, recommenders, and many more.

đź’ˇ
What is an embedding?

An embedding is a vector that stands in for some other object (digital, physical, or abstract) such that some of the properties of the object it stands for are preserved as geometric properties of their corresponding vectors.

This post tries to explain what that means concretely.

Vectors

A vector is just a list of numbers that we can interpret as coordinates in a high-dimensional space.

Imagine a list of two numbers $(x,y)$. We could treat them as coordinates on an x-y graph, with each list corresponding to a point in a two-dimensional space:

Coordinate plane with labeled x and y axes, including a marked point, presumably at coordinates (x, y)

If we have three numbers $(x,y,z)$, then this corresponds to a point in a three-dimensional space:

Diagram of a 3D coordinate system with x, y, and z axes and a labeled point on a sphere

We can readily extend this logic into more dimensions: four, five, a hundred, a thousand, even millions or billions. Drawing a space with a thousand dimensions is very hard, and imagining one is all but impossible, but mathematically, it’s trivial to manage.

Embeddings as Vectors

For example, let’s say we had a way to assign a small vector — just two numbers long — to pictures of apples and oranges.

Red and green apple on the left, with an orange on the right, against a white background

We can display vectors of just two numbers as points on an XY graph, so imagine we took pictures of apples and oranges and assigned each a random vector, then plotted them on a graph:

Scatterplot with red and orange dots representing different data points labeled "Apple" and "Orange" on a black background

The apples and oranges are all mixed together, and nothing about how they’re placed on the graph tells us whether a picture is of an apple or an orange.

What we’d like to have is something like this:

Scatter plot with red dots for apples and orange dots for oranges, plotted against taste and vision on a black background

If we had the means to assign vectors like these to pictures of apples and oranges, it would be trivial to write a program to tell them apart. All it would have to do is check which side of a line the vector fell on.

Scatter plot on a black background with red points for apples, orange points for oranges, and labels "Apple" and "Orange" on axes

AI and neural networks provide a way to do exactly that.

Using machine learning, we can construct an AI model that takes images and then outputs vectors such that the oranges and apples are easily separated. These vectors typically have a lot more than two dimensions, so they aren’t easy to draw.

The property of the original object that we care about — whether it is an apple or an orange — is preserved in the geometry of the embedding vectors: The apples are all to one side and the oranges to the other, easily separated by what in high-dimensional geometry is called a hyper-plane, but is really just a simple way of splitting the embedding space.

Red apple on the left with arrows pointing to 'Trained AI Model' and 'Embedding Vector' signifying a process

This example is very simple and concerns only one distinction: apples or oranges. Embeddings can capture a lot of information about many different aspects of the objects they represent. If we trained an AI model to distinguish apples from oranges, we would likely find that the embeddings encoded many of the characteristics that distinguish apples from oranges, like color and texture and the presence of a stem. We might additionally find that the green apples are clustered together, farther from the red ones, or that highly mottled apples had embeddings closer to the oranges.

Word Embeddings

Embeddings emerge almost automatically from neural network-based AI models, as we will show in the next section, but they were first developed as a way to better encapsulate the semantics (i.e. meanings) of words for computers. You will still sometimes see people talk about "semantic embeddings" because of the role word embeddings have played in this technology, even though embeddings aren’t specifically about some kind of real-world meaning but about any useful property that an AI model can learn to recognize.

In natural language processing, embeddings have always been associated with an idea in linguistics called the distributional hypothesis, most frequently associated with the linguist J.R. Firth:

You shall know a word by the company it keeps.

A Synopsis of Linguistic Theory, 1930-55.
(In Studies in Linguistic Analysis, 1957, pp. 1-31)

The distributional hypothesis is the idea that the semantics — that is to say, the meaning — of a word is reflected in the contexts of its usage. We should expect that the more related the meanings of two words, the more we should see them in the same contexts.

For example, when two words are spelled or pronounced the same, we use context to figure out which word is intended:

  • She took her bat to baseball practice.
  • She was chased away from the cave by a bat.

It is clear from the surrounding context which meaning of the word bat is intended.

Frightened young girl in a yellow dress running from a flying dragon with bats against a cloudy sky

Similarly, we can often expect that words from the same semantic class, like tomato, carrot, and zucchini, appear in the same contexts:

  • He grew tomatoes in his garden.
  • He grew carrots in his garden.
  • He grew zucchini in his garden.

But not:

  • He grew bicycles in his garden.
Creative display of colorful bicycles artistically mounted on a large tree with extended roots in an open field

Words that can fill the blank in “He grew _____ in his garden” share a property that they do not share with bicycles, and we can build an AI model that encompasses this information, along with thousands of other contexts where some words can go and others can’t.

Word embeddings are a way to encapsulate this information in a vector space.

Black background with a white border displaying red words: "bicycle" at top right and "zucchini", "tomato", "carrot" below in a list

Image Recognition

We can even construct embeddings where we don’t explicitly say what features are relevant.

For example, an embedding for a face recognition system translates pictures of people into embedding vectors distributed in a high-dimensional space.

Grid of black-and-white portraits of famous people including J.K. Rowling and Emma Thompson on a black background

We might find that people whose faces have some common properties would be grouped together in some part of the embedding space. We might find, for example, that grey-haired people and bald people cluster together:

Chart with celebrity faces in red and green outlined boxes on a black background indicating social media interactions

But more importantly and usefully, we would expect that if we gave it two pictures of the same person, those two embeddings would be closer to each other than to any other person’s picture:

A central woman with smaller headshots linked to her by arrows, set against a graded background

Of course, it would also find people who just look similar, although we hope that they wouldn’t be as close together as two people who are actually the same:

Black background with multiple portraits arranged around a central white arrow pointing at a specific portrait

Multimodal Embeddings

Embedding spaces can also act as a bridge between different kinds of inputs.

For example, if we have a database of pictures with descriptive captions, we can co-train two embedding models — one for images and one for texts — to output vectors in the same embedding space. The result is that embeddings of descriptions of images and embeddings of the images themselves will be near each other.

Black background with labeled fruits: three apples (McIntosh, Fuji, Apple) and three oranges (Clementine, Orange, Valencia)

Embeddings Everywhere

Embeddings provide a very generalized framework for making useful comparisons between things. They are so useful that they see applications all over AI and machine learning. Any application that requires similarity/dissimilarity evaluation, relies on hidden or non-obvious features, or requires implicit context-sensitive mappings between different inputs and outputs, likely uses embeddings in some form.

They are actively in use for:

  • Search and information retrieval for all media types
  • Face and object recognition in machine vision
  • Question-answering systems
  • Recommender systems
  • Outlier detection (often as part of fraud detection)
  • Spellchecking and grammar correction
  • Natural language understanding
  • Machine translation

This is a far from exhaustive list because nearly every AI application uses embeddings. In some scenarios, embeddings are used internally and are invisible to users and developers, but many AI applications hinge on the ability to use embeddings to identify and compare things in useful, human-like ways.

Jina AI is developing a collection of high-performance specialized embedding models that you can download or use via our public API, without any of the complexities of AI development. We are also preparing a suite of open-source tools in Python to help you integrate embedding models into your tech stack.

Jina AI is committed to providing you with tools and help for creating, optimizing, evaluating, and implementing embedding models for your enterprise. We’re here to help you navigate the new world of business AI. Contact us via our website or join our community on Discord to get started.

Jina AI - Your Portal to Multimodal AI
Jina AI offers powerful multimodal AI solutions for everyday users, developers, and scalable enterprise solutions. We aim to democratize access to the limitless potential of AI-generated creativity and innovation, empowering individuals and businesses alike.
Join the Jina AI Discord Server!
Check out the Jina AI community on Discord - hang out with 3155 other members and enjoy free voice and text chat.