How to Use Every Vector Database in Python with DocArray

DocArray handles multimodal data and works with every vector database using a universal API. Here's how you can do it in Python.

Assortment of wooden geometric shapes arranged on a table with a neutral-toned backdrop

Back in the day, pre-Google, the Internet was mostly text. Whether it was news updates, sports scores, blog posts or emails, ASCII and Unicode were the way to go.

Aaah, the good old days. Just pure ASCII as God intended.

But nowadays, data is becoming increasingly complex and multimodal, mostly coming in unstructured forms such as images, videos, text, 3D mesh, etc. Gone are the days of being limited to 26 characters and 10 numbers (or more for other character sets). Now thereโ€™s much more stuff to deal with.

Just think about your favorite YouTube videos, Spotify songs, or game NPCs.

Typical databases canโ€™t handle these kinds of multimodal data. They can only store and process structured data (like simple text strings or numbers). This really limits our ability to extract valuable business insights and value from a huge chunk of the 21st century's data.

Lucky for us, recent advancements in machine learning techniques and approximate nearest neighbor search have made it possible to better utilize unstructured data:

  • Deep learning models and representation learning to effectively represent complex data using vector embeddings.
  • Vector databases leverage vector embeddings to store and analyze unstructured data.

What are vector databases?

A vector database is a type of database that can index and retrieve data using vectors, similar to how a traditional database uses keys or text to search for items using an index.

A vector database uses a vector index to enable fast retrieval and insertion by a vector, and also offers typical database features such as CRUD operations, filtering, and scalability.

This gives us the best of both worlds - we get the CRUDiness of traditional databases, coupled with the ability to store complex, unstructured data like images, videos, and 3D meshes.

So, vector databases are great, right? Whatโ€™s even more awesome is having a library to use them all while being capable of handling unstructured data at the same time! One unstructured data library to rule them all!

We are, of course, talking about DocArray. Letโ€™s see what this project is all about.

Welcome to DocArray!
DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API. ๐Ÿšช Door to multimodal world: s..โ€ฆ

DocArray's universal Pythonic API to all vector databases

As the description suggests on the project home page, DocArray is a library for nested, unstructured and multimodal data.

This means that if you want to process unstructured data and represent it as vectors, DocArray is perfect for you.

DocArray is also a universal entrypoint for many vector databases.

For the remainder of this post, weโ€™ll be using DocArray to index and search data in the Amazon Berkeley Objects Dataset. This dataset contains product items with accompanying images and metadata such as brand, country, and color, and represents the inventory of an e-commerce website.

Although a traditional database can perform filtering on metadata, it is unable to search image data or other unstructured data formats. Thatโ€™s why weโ€™re using a vector database!

Weโ€™ll start by loading a subset of the Amazon Berkeley Objects Dataset that comes in CSV format into DocArray and computing vector embeddings.

Sample images from the dataset

Then, we'll use DocArray with each database to perform search and insertion operations using vectors.

Weโ€™ll use the following databases via DocArray in Python:

  • Milvus - cloud-native vector database with storage and computation separated by design
  • Weaviate - vector search engine that stores both objects and vectors and can be accessed through REST or GraphQL
  • Qdrant - vector database written in Rust and designed to be fast and reliable under high loads
  • Redis - in-memory key-value database that supports different kinds of data structures with vector search capabilities
  • ElasticSearch - distributed, RESTful search engine with Approximate Nearest Neighbor search capabilities
  • OpenSearch - open-source search software based on Apache Lucene originally forked from ElasticSearch
  • AnnLite - a Python library for fast Approximate Nearest Neighbor Search with filtering capabilities

For each database, weโ€™ll:

  • Setup the database and install requirements
  • Index the data in the vector database
  • Perform a vector search operation with filtering
  • Display the search results
๐Ÿ’ก
The returned results will be the same for each database, since we use the same vectors each time. The key differences are in resource usage, latency, etc.

Preparing the data

First, weโ€™ll install a few dependencies, namely DocArray, Jina (for cloud authentication), and the client for CLIP-as-service (for generating embeddings):

pip install docarray[common] jina clip-client

Letโ€™s download a sample CSV dataset and load it into a DocumentArray with DocumentArray.from_csv():

wget https://github.com/jina-ai/product-recommendation-redis-docarray/raw/main/data/dataset.csv
from docarray import DocumentArray, Document

with open('dataset.csv') as fp:
  da = DocumentArray.from_csv(fp)

We get an overview using the summary() method:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Documents Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                               โ”‚
โ”‚   Type                   DocumentArrayInMemory                โ”‚
โ”‚   Length                 5809                                 โ”‚
โ”‚   Homogenous Documents   True                                 โ”‚
โ”‚   Common Attributes      ('id', 'mime_type', 'uri', 'tags')   โ”‚
โ”‚   Multimodal dataclass   False                                โ”‚
โ”‚                                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Attributes Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                              โ”‚
โ”‚   Attribute   Data type   #Unique values   Has empty value   โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
โ”‚   id          ('str',)    5809             False             โ”‚
โ”‚   mime_type   ('str',)    1                False             โ”‚
โ”‚   tags        ('dict',)   5809             False             โ”‚
โ”‚   uri         ('str',)    4848             False             โ”‚
โ”‚                                                              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

We can also display the images of the first few items using theย plot_image_sprites() method:

da[:12].plot_image_sprites()

Each product contains the metadata as a dict in the tags attribute:

da[0].tags
{'height': '1926',
 'color': 'Blue',
 'country': 'CA',
 'width': '1650',
 'brand': 'Thirty Five Kent'}

Generating embeddings

Next, weโ€™ll encode the Documents into vectors using Clip-as-service.

๐Ÿ’ก
CLIP-as-service is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a microservice into neural search solutions.

First, we need to log in to Jina AI Cloud:

jina auth login

Letโ€™s create an authentication token to use the service:

jina auth token create abo -e 30

Then, we can actually use the service to generate embeddings:

from clip_client import Client

c = Client(
    'grpcs://api.clip.jina.ai:2096', credential={'Authorization': 'your-auth-token'}
)

encoded_da = c.encode(da, show_progress=True)

Preparing a search Document

If weโ€™re going to search our database, we need something to search our database with. As with everything in DocArray, the fundamental unit is the Document. So letโ€™s prepare a query Document to search with. Weโ€™ll just select the first product in our dataset:

doc = encoded_da[0]
doc.display()

Indexing the data

Now that the data is ready, we can index it and start performing search queries. In the next sections, we will index with each supported database.

Milvus

Milvus is an open-source vector database built to power embedding similarity search and AI applications. It is a cloud-native database with storage and computation separated by design.

This means that scaling each layer individually is possible. Thus, Milvus offers a scalable and reliable architecture.

Start a Milvus service using the following YAML:

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.0
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2022-03-17T06-34-49Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.1.4
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

networks:
  default:
    name: milvus
docker-compose up

Then, create a DocumentArray instance connected to Milvus. Make sure to install DocArray using the milvus tag:

pip install "docarray[milvus]"
milvus_da = DocumentArray(storage='milvus', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
with milvus_da:
    milvus_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue':

filter = 'color == "Blue"'
results = milvus_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

Being part of LFAI & Data, Milvus represents a production-ready cloud-native vector database.

Read more about Milvus support in DocArray:

Milvus
One can use Milvus as the Document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e..match(),.find(). Usage: Start Milvus service: To use Milvus as the storage backend, you need a running Milvus server. You can use the following docker-comp...

Weaviate

Weaviateย is anย open sourceย vector search engine that storesย both objects and vectors, allowing forย combining vector search with structured filtering.

It offers features like fault-tolerance and scalability and is accessible either through REST or GraphQL.

Start a Weaviate server using the following YAML:

version: '3.4'
services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: semitechnologies/weaviate:1.16.1
    ports:
      - "8080:8080"
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: ''
      CLUSTER_HOSTNAME: 'node1'
docker-compose up

Then, create a DocumentArray instance connected to Weaviate. Make sure to install DocArray using the weaviate tag:

pip install "docarray[weaviate]"
weaviate_da = DocumentArray(storage='weaviate', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
weaviate_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue' :

filter = {'path': 'color', 'operator': 'Equal', 'valueString': 'Blue'}
results = weaviate_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

Therefore, Weaviate offers vector search functionalities with filtering support and features like replication, hybrid search, dynamic batching, etc

Read more about Weaviate integration in DocArray:

Weaviate
You can use Weaviate as a document store for DocumentArray. Itโ€™s suitable for faster Document retrieval on embeddings, i.e..match(),.find(). Hereโ€™s a video tutorial on building a simple image search using Weaviate and DocArray: Usage: Start Weaviate service: To use Weaviate as the storage backend...

Qdrant

Qdrant is an open source vector database. Written in Rust and offering a fast and reliable search experience even under high load. Actually, it ranks in DocArrayโ€™s one million benchmarks as the fastest on-disk vector database (As of the versions used to conduct the experiment).

Qdrant comes with filtering support and a convenient API using HTTP or gRPC.

Start a Qdrant server using the following YAML:

version: '3.4'
services:
  qdrant:
    image: qdrant/qdrant:v0.10.1
    ports:
      - "6333:6333"
      - "6334:6334"
    ulimits: # Only required for tests, as there are a lot of collections created
      nofile:
        soft: 65535
        hard: 65535
docker-compose up

Then, create a DocumentArray instance connected to Qdrant. Make sure to install DocArray using the qdrant tag:

qdrant_da = DocumentArray(storage='qdrant', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
qdrant_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue':

filter = {'must': [{'key': 'color', 'match': {'value': 'Blue'}}]}
results = qdrant_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

Qdrant offers a fast and reliable search service. It supports filtering and vector search at scale.

Its gRPC support also makes it convenient for indexing in batches since indexing datasets with gRPC is much faster than using the HTTP protocol.

Read more about the Qdrant integration in DocArray:

Qdrant
You can use Qdrant as a document store for DocumentArray. Itโ€™s suitable for faster Document retrieval on embeddings, i.e..match(),.find(). Usage: Start Qdrant service: To use Qdrant as the storage backend, you need a running Qdrant server. You can create docker-compose.yml to use the Qdrant Docke...

Redis

Redis is an open source in-memory key-value database. Redis supports different kinds of data structures and provides access using a set of commands using TCP sockets.

In its RediSearch module 2.4, Redis added vector search capabilities. This means Redis can be viewed as an in-memory vector database.

Start a Redis server using the following YAML:

docker run -d -p 6379:6379 redis/redis-stack:latest

Then, create a DocumentArray instance connected to Redis. Make sure to install DocArray using the `redis` tag:

pip install "docarray[redis]"
redis_da = DocumentArray(storage='redis', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
redis_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue':

filter = '@color:Blue'
results = redis_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

Redis, since itโ€™s an in-memory store, offers faster search queries compared to on-disk databases. It ranks in DocArrayโ€™s one million benchmarks as the fastest database server.

Use it if you need fast vector search and operations to a vector database while being able to index data in-memory.

Read more about the Redis integration in DocArray:

Redis
You can use Redis as a document store for DocumentArray. Itโ€™s suitable for faster Document retrieval on embeddings, i.e..match(),.find(). Usage: Start Redis service: To use Redis as the storage backend, it is required to have the Redis service started. Create docker-compose.yml as follows: Then C...

ElasticSearch

ElasticSearch is an open-source, distributed and RESTful search engine. It can be used to search, store and manage data.

๐Ÿ’ก
ElasticSearch introduced Approximate Nearest Neighbor search in Elasticsearch 8.0.

This means that any ElasticSearch Server with version > 8.0 has vector search capabilities.

Start an ElasticSearch server using the following YAML:

version: "3.3"
services:
  elastic:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.2.0
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
    ports:
      - "9200:9200"
    networks:
      - elastic

networks:
  elastic:
    name: elastic
docker-compose up

Then, create a DocumentArray instance connected to ElasticSearch. Make sure to install DocArray using the elasticsearch tag:

pip install "docarray[elasticsearch]"
elasticsearch_da = DocumentArray(storage='elasticsearch', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
elasticsearch_da.extend(encoded_da, request_timeout=60)

Now, make a search query for items similar to doc with filter color='Blue':

filter = {'match': {'color': 'Blue'}}
results = elasticsearch_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

ElasticSearch is convenient for production use cases. It offers features like scalability, data distribution, filtering, hybrid search, etc. It comes with security, observability, and cloud-nativeness.

Read more about ElasticSearch integration in DocArray:

Elasticsearch
You can use Elasticsearch as a document store for DocumentArray. Itโ€™s suitable for faster Document retrieval on embeddings, i.e..match(),.find(). Usage: Start Elastic service: To use Elasticsearch as the storage backend, it is required to have the Elasticsearch service started. Create docker-comp...

OpenSearch

OpenSearch is a scalable, flexible, and extensible open-source program for search, licensed under Apache 2.0.

OpenSearch is powered by Apache Lucene and was originally forked from ElasticSearch. This means OpenSearch includes most features of ElasticSearch.

Like ElasticSearch, OpenSearch includes Approximate Nearest Neighbor Search, allowing it to perform vector similarity search.

Start an OpenSearch server using the following YAML:

version: "3.3"
services:
  opensearch:
    image: opensearchproject/opensearch:2.4.0
    environment:
      - plugins.security.disabled=true
      - discovery.type=single-node
    ports:
      - "9900:9200"
    networks:
      - os
networks:
  os:
    name: os
docker-compose up

Then, create a DocumentArray instance connected to ElasticSearch. Make sure to install DocArray using the opensearch tag:

pip install "docarray[opensearch]"
opensearch_da = DocumentArray(storage='opensearch', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
opensearch_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue':

filter = {'match': {'color': 'Blue'}}
results = opensearch_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

Like ElasticSearch, OpenSearch is convenient for production use-cases. It also offers better integration with AWS Cloud. OpenSearch also has a more open license than ElasticSearch.

Read more about OpenSearch integration in DocArray:

Opensearch
You can use Opensearch as the document store for DocumentArray. It is useful when you wants to have faster Document retrieval on embeddings, i.e..match(),.find(). Usage: Start Opensearch service: To use Opensearch as the storage backend, it is required to have the Opensearch service started. Crea...

AnnLite

AnnLite is a Python library for fast Approximate Nearest Neighbor search with filtering capabilities. Built by Jina AI, it offers an easy vector search experience as a library (no client-server architecture).

To use AnnLite, install DocArray using the annlite tag:

pip install "docarray[annlite]"
annlite_da = DocumentArray(storage='annlite', config={
    'n_dim': 768,
    'columns': {
        'color': 'str',
        'country': 'str',
        'product_type': 'str',
        'width': 'int',
        'height': 'int',
        'brand': 'str',
    }
})
 
# Index data
annlite_da.extend(encoded_da)

Now, make a search query for items similar to doc with filter color='Blue':

filter = {'color': {'$eq': 'Blue'}}
results = annlite_da.find(doc,filter=filter, limit=5)
results[0].plot_image_sprites()

AnnLite is easy to install and use. With DocArray, it offers a great local vector search with filtering capabilities. Since it does not rely on a client-server architecture, there is no network overhead, yet it implements HNSW for fast vector search.

This explains why it ranks first on Jina AIโ€™s one million scale benchmark.

Read more about AnnLite integration in DocArray:

Annlite
You can use Annlite as a document store for DocumentArray. Itโ€™s suitable for faster Document retrieval on embeddings, i.e..match(),.find(). Usage: You can instantiate a DocumentArray with Annlite storage like so: The usage would be the same as the ordinary DocumentArray. To access a DocumentArray...

Conclusion

Vector databases let us efficiently leverage unstructured data and extract useful insights from it. They can perform vector searches, which are useful for similarity matching, recommendations, analysis, etc.

Choosing the right database can be challenging, depending on your resources, use cases, and requirements. For example, you would choose a different database for high speed versus low memory.

To help you decide, weโ€™ve published benchmarks of vector databases using DocArray. That said, you may need to test out a few databases to find your match.

Normally that testing would mean learning each database before you could use it, taking lots of time and effort. But with DocArrayโ€™s unified API you can speak to all of these databases the same way. All it takes is changing one or two lines of code, and bam, youโ€™re using a new database. You can check the docs for more information and install it with pip install docarray.