Releases

DocArray 0.36 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Engineering Group

Jul 19, 2023 • 4 min read

Release Note (`0.36.0`)

This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

🆕 Features

JAX Integration (#1646)

You can now use JAX with DocArray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:

from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp

class MyDoc(BaseDoc):
    arr: JaxArray
    image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
    square_crop: JaxArray[3, ‘x’, ‘x’] # For any square image, regardless of dimensions
    random_image: JaxArray[3, ...]  # For any image with 3 color channels, and arbitrary other dimensions

As you can see, the JaxArray type is extremely flexible and can support a wide range of tensor shapes.

Creating a Document with Tensors

As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.

Creating a document with tensors is straightforward. Here is an example:

doc = MyDoc(
    arr=jnp.zeros((128,)),
    image_arr=jnp.zeros((3, 224, 224)),
    square_crop=jnp.zeros((3, 64, 64)),
    random_image=jnp.zeros((3, 128, 256)),
)

Redis Integration (#1550)

Leverage the power of Redis in your DocArray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10]
docs = [MyDoc(text=f’text {i}’, embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host=‘localhost’)
db.index(docs)
results = db.find(query, search_field=‘embedding’, limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Text Search: Leverage text search methods, such as BM25, to find relevant documents.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined.
Subindex: Search through nested data.

🚀 Performance

Speedup `HnswDocumentIndex` by caching num docs (#1706)

We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time but significantly accelerates search times.

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]
docs = [MyDoc(text=‘hey’, embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir=‘tst’, index_name=‘index’)
index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start
query = docs[0]
find_start = time.time()
matches, _ = index.find(query, search_field=‘embedding’, limit=10)
find_time = time.time() - find_start

In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.

⚙ Refactoring

Put `contains` method in the base class (#1701)

We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.

More robust method to detect duplicate index (#1651)

We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex.

🐞 Bug Fixes

`WeaviateDocumentIndex` handles lowercase index names (#1711)

We've addressed an issue in the WeaviateDocumentIndex where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index. To resolve this, we've added a post_init function that capitalizes the first letter of the provided index name, ensuring consistent naming and preventing potential errors.

`QdrantDocumentIndex` unable to see `index_name` (#1705)

We've resolved an issue where the QdrantDocumentIndex was not properly recognizing the index_name parameter. Previously, the specified index_name was ignored and the system defaulted to the schema name.

Fix search in `InMemoryExactNNIndex` with `AnyEmbedding` (#1696)

From now on, you can perform search operations in InMemoryExactNNIndex using AnyEmbedding

Use `safe_issubclass` everywhere (#1691)

We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors

Avoid converting `DocLists` in the base index (#1685)

We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.

📗 Documentation Improvements

Add docs for dict() method (#1643)

🤟 Contributors

We would like to thank all contributors to this release:

Puneeth K (@punndcoder28)
Joan Fontanals (@JoanFM)
Saba Sturua (@jupyterjazz)
Aman Agarwal (@agaraman0)
samsja (@samsja)
Shukri (@hsm207)

Release Note (0.36.0)