DocArray 0.31 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Cosmic-themed image with "docarray What's new in 0.31?" centrally and "SANDBOX PROJECT" below

This release contains 2 breaking changes, 4 new features, 11 bug fixes, and several documentation improvements.

Release Note (v0.31.0)

Release ๐Ÿ’ซ Release v0.31.0 ยท docarray/docarray
Release Note (v0.31.0)This release contains 4 new features, 11 bug fixes, and several documentation improvements.๐Ÿ’ฅ Breaking changesReturn type of DocVec Optional Tensor (#1472)Optional tensor fโ€ฆ

๐Ÿ’ฅ Breaking changes

Return type of DocVec Optional Tensor (#1472)

Optional tensor fields in a DocVec will return None instead of a list of Nan if the column does not hold any tensor.

This code snippet shows the breaking change:

from typing import Optional

from docarray import BaseDoc, DocVec
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    tensor: Optional[NdArray[10]]

docs = DocVec[MyDoc]([MyDoc() for j in range(2)])

print(docs.tensor)
VersionReturn type
0.30.0[nan nan]
0.31.0None

Default index collection names

Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.

In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name or collection_name.

Starting with DocArray v0.31.0, the default index_name/collection_name will be derived from the document schema name:

from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc

class MyDoc(BaseDoc):
    pass

# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()

If you create and persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.

You can fix this by manually specifying the index name to match the old default:

# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')

The below table summarizes the change for all database backends:

DBConfig argumentDefault in v0.30.0Default in v0.31.0
WeaviateDocumentIndexindex_name'Document'Schema class name
QdrantDocumentIndexcollection_name'documents'Schema class name
ElasticDocIndexindex_name'index__' + a random idSchema class name
ElasticV7DocIndexindex_name'index__' + a random idSchema class name
HnswDocumentIndexn/an/an/a

๐Ÿ†• Features

Add InMemoryDocIndex (#1441)

In this version we have introduced the InMemoryDocIndex Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).

The InMemoryDocIndex can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray

import numpy as np

class MyDoc(BaseDoc):
    tensor: NdArray[512]

docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

DocList inherits from Python list (#1457)

DocList is now a subclass of Python's list. This means that you can now use all the methods that are available to Python lists on DocList objects. For example, you can now use len on DocList objects and tools like Pydantic or FastAPI will be able to work with it more easily.

Add len to DocIndex (#1454)

You can now perform len(vector_index) which is equivalent to vector_index.num_docs().

Other minor features

  • Add a to_json alias to BaseDoc (#1494)

๐Ÿž Bug Fixes

Point to older versions when importing Document or Documentarray (#1422)

Trying to load Document or DocumentArray from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.

Fix AnyDoc.from_protobuf (#1437)

AnyDoc can now read any BaseDoc protobuf file. The same applies to DocList.

Other bug fixes

  • Fix extend to DocList (#1493)
  • Fix bug when calling dict() on BaseDoc (#1481)
  • Fix bug when calling json() on BaseDoc (#1481)
  • Support Pandas 2.0 by using pd.concat() instead of df.append() in to_dataframe() to avoid warning (#1478)
  • Add logs to Elasticsearch index (#1427)
  • Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to ndarray (#1429)
  • Fix a bug with HNSW (#1426)
  • Hubble Binary format version bump (#1414)
  • Save index during creation for hnswlib (#1424)

๐Ÿ“— Documentation Improvements

  • Fix FastAPI docs (#1453)
  • Index predefined Documents (#1434)
  • Clean up data types section (#1412)
  • Remove duplicate API reference section (#1408)
  • Docindex URLs (#1433)
  • Fix Install commands hint (#1421)
  • Add Google Analytics (#1432)
  • Add install instructions for hnswlib and elastic document indexes (#1431)
  • Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)

๐Ÿค˜ Contributors

We would like to thank all contributors to this release: