Releases

DocArray 0.31 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Engineering Group

May 8, 2023 • 4 min read

This release contains 2 breaking changes, 4 new features, 11 bug fixes, and several documentation improvements.

Release Note (`v0.31.0`)

💥 Breaking changes

Return type of `DocVec` Optional Tensor (#1472)

Optional tensor fields in a DocVec will return None instead of a list of Nan if the column does not hold any tensor.

This code snippet shows the breaking change:

from typing import Optional

from docarray import BaseDoc, DocVec
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    tensor: Optional[NdArray[10]]

docs = DocVec[MyDoc]([MyDoc() for j in range(2)])

print(docs.tensor)

Version	Return type
0.30.0	`[nan nan]`
0.31.0	`None`

Default index collection names

Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.

In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name or collection_name.

Starting with DocArray v0.31.0, the default index_name/collection_name will be derived from the document schema name:

from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc

class MyDoc(BaseDoc):
    pass

# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()

If you create and persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.

You can fix this by manually specifying the index name to match the old default:

# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')

The below table summarizes the change for all database backends:

	DBConfig argument	Default in v0.30.0	Default in v0.31.0
WeaviateDocumentIndex	`index_name`	'Document'	Schema class name
QdrantDocumentIndex	`collection_name`	'documents'	Schema class name
ElasticDocIndex	`index_name`	'index__' + a random id	Schema class name
ElasticV7DocIndex	`index_name`	'index__' + a random id	Schema class name
HnswDocumentIndex	n/a	n/a	n/a

🆕 Features

Add `InMemoryDocIndex` (#1441)

In this version we have introduced the InMemoryDocIndex Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).

The InMemoryDocIndex can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray

import numpy as np

class MyDoc(BaseDoc):
    tensor: NdArray[512]

docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))

FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

`DocList` inherits from Python `list` (#1457)

DocList is now a subclass of Python's list. This means that you can now use all the methods that are available to Python lists on DocList objects. For example, you can now use len on DocList objects and tools like Pydantic or FastAPI will be able to work with it more easily.

Add `len` to `DocIndex` (#1454)

You can now perform len(vector_index) which is equivalent to vector_index.num_docs().

Other minor features

Add a to_json alias to BaseDoc (#1494)

🐞 Bug Fixes

Point to older versions when importing `Document` or `Documentarray` (#1422)

Trying to load Document or DocumentArray from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.

Fix `AnyDoc.from_protobuf` (#1437)

AnyDoc can now read any BaseDoc protobuf file. The same applies to DocList.

Other bug fixes

Fix extend to DocList (#1493)
Fix bug when calling dict() on BaseDoc (#1481)
Fix bug when calling json() on BaseDoc (#1481)
Support Pandas 2.0 by using pd.concat() instead of df.append() in to_dataframe() to avoid warning (#1478)
Add logs to Elasticsearch index (#1427)
Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to ndarray (#1429)
Fix a bug with HNSW (#1426)
Hubble Binary format version bump (#1414)
Save index during creation for hnswlib (#1424)

📗 Documentation Improvements

Fix FastAPI docs (#1453)
Index predefined Documents (#1434)
Clean up data types section (#1412)
Remove duplicate API reference section (#1408)
Docindex URLs (#1433)
Fix Install commands hint (#1421)
Add Google Analytics (#1432)
Add install instructions for hnswlib and elastic document indexes (#1431)
Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)

🤘 Contributors

We would like to thank all contributors to this release:

Alex Cureton-Griffiths (@alexcg1)
samsja (@samsja)
Johannes Messner (@JohannesMessner)
Anne Yang (@AnneYang720)
Scott Martens (@scott-martens)
カレン (@RStar2022)
Aman Agarwal (@agaraman0)
Yanlong Wang (@nomagick)
Charlotte Gerhaher (@anna-charlotte)

Release Note (v0.31.0)

💥 Breaking changes

Return type of DocVec Optional Tensor (#1472)