Releases

DocArray 0.37 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Engineering Group

Aug 3, 2023 • 4 min read

Release Note (`0.37.0`)

This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

🆕 Features

Milvus Integration (#1681)

Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10] = Field(is_embedding=True)

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search.
Subindex: Search through nested data.

Support filtering in `HnswDocumentIndex` (#1718)

With our latest update, you can easily utilize filtering in HnswDocumentIndex either as an independent function or in conjunction with the query builder to combine it with vector search.

The code below shows how the new feature works:

import numpy as np

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray


class SimpleSchema(BaseDoc):
    year: int
    price: int
    embedding: NdArray[128]


# Create dummy documents.
docs = DocList[SimpleSchema](
    SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
    for i in range(10)
)

doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)

# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)

# Filtering combined with vector search
hybrid_query = (
    doc_index.build_query()  # get empty query object
    .filter(filter_query={"year": {"$gt": 1994}})  # pre-filtering (year > 1994)
    .find(
        query=np.random.rand(128), search_field="embedding"
    )  # add vector similarity search
    .filter(filter_query={"price": {"$lte": 3}})  # post-filtering (price <= 3)
    .build()
)
results = doc_index.execute_query(hybrid_query)

First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.

Pre-filtering in `InMemoryExactNNIndex` (#1713)

You can now add a pre-filter to your queries in InMemoryExactNNIndex. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:

query = (
   doc_index.build_query()
   .filter(filter_query={'price': {'$lte': 3}})  # Pre-filter: price <= 3
   .find(query=np.ones(10), search_field='tensor')  # Vector search
   .filter(filter_query={'text': {'$eq': 'hello 1'}})  # Post-filter: text == 'hello 1'
   .build()
)

In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!

Support document updates in `InMemoryExactNNIndex` (#1724)

You can now easily update your documents in InMemoryExactNNIndex. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.

Choose tensor format with `DocVec` deserialization (#1679)

Now you can specify the format of your tensor during DocVec deserialization. You can do this with any method you're using to convert data - like protobuf, json, pandas, bytes, binary, or base64. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.

Add description and example to `id` field of `BaseDoc` (#1737)

We added a description and example to the id field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.

🚀 Performance

Improve `HnswDocumentIndex` performance (#1727, #1729)

We've implemented two key optimizations to enhance the performance of HnswDocumentIndex. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib index itself. Additionally, we've minimized the frequency of computing num_docs(), which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.

🐞 Bug Fixes

Fix `TorchTensor` type comparison (#1739)

We have addressed an exception raised when trying to compare TorchTensor with the type keyword in the docarray.typing module. Previously, this would lead to a TypeError, but the error has now been resolved, ensuring proper type comparison.

Add more info from dynamic class (#1733)

When using the method create_base_doc_from_schema to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description and examples.

Fix call to unsafe `issubclass` (#1731)

We fixed a bug calling issubclass by changing the call for a safer implementation against some types.

Align collection and index name in `QdrantDocumentIndex` (#1723)

We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex. This ensures consistent naming between collections and their respective subindexes.

Fix deepcopy TorchTensor (#1720)

We fixed a bug that will allow deepcopying documents with TorchTensors.

📗 Documentation Improvements

Make Document Indices self-contained (#1678)

🤘 Contributors

We would like to thank all contributors to this release:

Joan Fontanals (@JoanFM)
Johannes Messner (@JohannesMessner)
Saba Sturua (@jupyterjazz)

Release Note (0.37.0)

🆕 Features

Milvus Integration (#1681)

Support filtering in HnswDocumentIndex (#1718)

Pre-filtering in InMemoryExactNNIndex (#1713)

Support document updates in InMemoryExactNNIndex (#1724)

Choose tensor format with DocVec deserialization (#1679)

Add description and example to id field of BaseDoc (#1737)