DocArray 0.33 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Promotional graphic with "docarray" and "What's new in 0.33?" over a colorful galaxy backdrop, hinting at new tech advancemen

Release Note (0.33.0)

This release contains 1 new feature, 1 performance improvement, 9 bug fixes, and 4 documentation improvements.

Release ๐Ÿ’ซ Release v0.33.0 ยท docarray/docarray
Release Note (0.33.0) Release time: 2023-06-06 14:05:56 This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements. ๐Ÿ†• Features Allow coercion betโ€ฆ

๐Ÿ†• Features

Allow coercion between different Tensor types (#1552) (#1588)

Allow coercing to a TorchTensor from an NdArray or TensorFlowTensor and the other way around.

from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np

class MyTensorsDoc(BaseDoc):
    tensor: TorchTensor

doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐Ÿ“„ MyTensorsDoc : 0a10f88 ...
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Attribute           โ”‚ Value                                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ tensor: TorchTensor โ”‚ TorchTensor of shape (512,), dtype: torch.float64      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿš€ Performance

Avoid stack embedding for every search (#1586)

We have made a performance improvement for the find interface for InMemoryExactNNIndex that gives a ~2x speedup.

The script used to measure this is as follows:

from torch import rand
from time import perf_counter

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor

class MyDocument(BaseDoc):
    embedding: TorchTensor
    embedding2: TorchTensor
    embedding3: TorchTensor

def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
    return DocList[MyDocument](
        [
            MyDocument(
                embedding=rand(dims),
                embedding2=rand(dims),
                embedding3=rand(dims),
            )
            for _ in range(num_docs)
        ]
    )

num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)

index = InMemoryExactNNIndex[MyDocument](data_list)

start = perf_counter()
for _ in range(5):
    matches, scores =  index.find_batched(queries, search_field='embedding')

print(f"Number of queries: {num_queries} \n"
      f"Number of indexed documents: {num_docs} \n"
      f"Total time: {(perf_counter() - start)/5} seconds")

๐Ÿž Bug Fixes

Respect limit parameter in filter for index backends (#1618)

InMemoryExactNNIndex and HnswDocumentIndex now respect the limit parameter in the filter API.

HnswDocumentIndex can search with limit greater than number of documents (#1611)

HnswDocumentIndex now allows to call find with a limit parameter larger than the number of indexed documents.

Allow updating HnswDocumentIndex (#1604)

HnswDocumentIndex now allows reindexing documents with the same id, updating the original documents.

Dynamically resize internal index to adapt to increasing number of documents (#1602)

HnswDocumentIndex now allows indexing more than max_elements, dynamically adapting the index as it grows.

Fix simple usage of HnswDocumentIndex (#1596)

from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')

Previously, this basic usage threw an exception:

TypeError: ModelMetaclass object argument after  must be a mapping, not MyDoc

Now, it works as expected.

Fix InMemoryExactNNIndex index initialization with nested DocList (#1582)

Instantiating an InMemoryExactNNIndex with a Document schema that had a nested DocList previously threw this error:

from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    text: str,
    d_list: DocList[TextDoc]

index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'

Now it can be successfully instantiated.

Fix summary of document with list (#1595)

Calling summary on a document with a List attribute previously showed the wrong type:

from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
    str_list: List[str]

dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()

Previous output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ•ฎ
โ”‚                       โ”‚
โ”‚   TestDoc             โ”‚
โ”‚   โ””โ”€โ”€ str_list: str   โ”‚
โ”‚                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

New output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                             โ”‚
โ”‚   TestDoc                   โ”‚
โ”‚   โ””โ”€โ”€ str_list: List[str]   โ”‚
โ”‚                             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Solve issues caused by issubclass (#1594)

DocArray relies heavily on calling Python's issubclass method which caused multiple issues. We now use a safe version that counts for edge cases and types.

Make example payload a string rather than bytes (#1587)

The example payload of a given document schema with Tensor attribute was previously of bytes type. This has now been changed to str.

from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')

๐Ÿ“— Documentation Improvements

  • Add forward declaration steps to example to avoid pickling error (#1615)
  • Fix n_dim to dim (#1610)
  • Add "in memory" to documentation as list of supported vector indexes (#1607)
  • Add a tensor section (#1576)

๐ŸคŸ Contributors

We would like to thank all contributors to this release: