DocArray 0.34 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Promotional graphic for Docarray's update 0.34 with a colorful abstract background and texts highlighting "SANDBOX PROJECT.


Release Note (0.34.0)

This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.

Release 💫 Release v0.34.0 · docarray/docarray
Release Note (0.34.0) Release time: 2023-06-21 08:15:43 This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.💣 Breaking ChangesTerminate Pyth…

💣 Breaking Changes

Terminate Python 3.7 support

⚠️
DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.

We decided to drop it for two reasons:

  • Several dependencies of DocArray require Python 3.8.
  • Python long-term support for 3.7 is ending this week. This means there will no longer be security updates for Python 3.7, making this a good time for us to change our requirements.

Changes to DocVec Protobuf definition (#1639)

In order to fix the bug in the DocVec Protobuf serialization described in #1561, we have changed the DocVec .proto definition.

This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray v.0.34.0 or later, and vice versa.

⚠️
We strongly recommend that everyone using Protobuf with DocVec upgrade to DocArray v0.34.0 or later.

🆕 Features

Allow users to check if a Document is already indexed in a DocIndex (#1633)

You can now check if a Document has already been indexed by using the in keyword:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = DocList[MyDoc](
        [MyDoc(text="Example text", embedding=np.random.rand(128))
         for _ in range(2000)])

index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index

Support subindexes in InMemoryExactNNIndex (#1617)

You can now use the find_subindex method with the ExactNNSearch DocIndex.

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor

class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor_image: AnyTensor = Field(space='cosine', dim=64)


class VideoDoc(BaseDoc):
    url: VideoUrl
    images: DocList[ImageDoc]
    tensor_video: AnyTensor = Field(space='cosine', dim=128)


class MyDoc(BaseDoc):
    docs: DocList[VideoDoc]
    tensor: AnyTensor = Field(space='cosine', dim=256)

doc_index = InMemoryExactNNIndex[MyDoc]()
...

# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
    np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)

Flexible tensor types for Protobuf deserialization (#1645)

You can deserialize any DocVec Protobuf message to any tensor type, by passing the tensor_type parameter to from_protobuf.

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

class MyDoc(BaseDoc):
    tensor: TensorFlowTensor

da = DocVec[MyDoc](...)  # doesn't matter what tensor_type is here

proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)

assert isinstance(da_after.tensor, TensorFlowTensor)

Add DBConfig to InMemoryExactNNSearch

InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which makes it possible to initialize from it. This will allow us to extend this config if more parameters are needed.

The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old usage.

These two initializations are equivalent.

from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')

index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')

🐞 Bug Fixes

Allow Protobuf deserialization of BaseDoc with Union type (#1655)

Serialization of BaseDoc types who have Union types parameter of Python native types is supported.

from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
    union_field: Union[int, str]

docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2

When these Union types involve other BaseDoc types, an exception is thrown.

class CustomDoc(BaseDoc):
    ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')

docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])

# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())

Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)

If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast tointeger.

Moved default_column_config from RuntimeConfig to DBconfig (#1648)

default_column_config contains specific configuration information about the columns and tables inside the backend's database. This was previously put inside RuntimeConfig which caused an error because this information is required at initialization time. This information has been moved inside DBConfig so you can edit it there.

from docarray.index import HNSWDocumentIndex
import numpy as np

db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)

Fix issue with Protobuf (de)serialization for DocVec (#1639)

This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.

Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)

Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned matches. Fixed by adding an option to sort matches in a reverse order based on their scores.

# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')

query = (
    db.build_query()
    .find(query=q_doc, search_field='embedding')
    .filter(filter_query={'text': {'$exists': True}})
    .build()
)

results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first

Working with external Qdrant collections (#1632)

When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of DocArray raised a KeyError. This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.

Other bug fixes

  • Update text search to match Weaviate client's new sig (#1654)
  • Fix DocVec equality (#1641, #1663)
  • Fix exception when summary() called for LegacyDocument. (#1637)
  • Fix DocList and DocVec coercion. (#1568)
  • Fix update() on BaseDoc with tensors fields (#1628)

📗 Documentation Improvements

  • Enhance DocVec section (#1658)
  • Qdrant in memory usage (#1634)

🤟 Contributors

We would like to thank all contributors to this release: