DocArray 0.36 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Docarray logo with the announcement "What's new in 0.36?" set against a cosmic backdrop with the text "SANDBOX PROJECT" below


Release Note (0.36.0)

This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

Release 💫 Release v0.36.0 · docarray/docarray
Release Note (0.36.0) Release time: 2023-07-18 14:43:28 This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.🆕 FeaturesJAX Integration (…

🆕 Features

JAX Integration (#1646)

You can now use JAX with DocArray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:

from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp

class MyDoc(BaseDoc):
    arr: JaxArray
    image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
    square_crop: JaxArray[3, ‘x’, ‘x’] # For any square image, regardless of dimensions
    random_image: JaxArray[3, ...]  # For any image with 3 color channels, and arbitrary other dimensions

As you can see, the JaxArray type is extremely flexible and can support a wide range of tensor shapes.

Creating a Document with Tensors

As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.

Creating a document with tensors is straightforward. Here is an example:

doc = MyDoc(
    arr=jnp.zeros((128,)),
    image_arr=jnp.zeros((3, 224, 224)),
    square_crop=jnp.zeros((3, 64, 64)),
    random_image=jnp.zeros((3, 128, 256)),
)

Redis Integration (#1550)

Leverage the power of Redis in your DocArray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10]
docs = [MyDoc(text=f’text {i}’, embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host=‘localhost’)
db.index(docs)
results = db.find(query, search_field=‘embedding’, limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

Find: Vector search for efficient retrieval of similar documents.
Filter: Use Redis syntax to filter based on textual and numeric data.
Text Search: Leverage text search methods, such as BM25, to find relevant documents.
Get/Del: Fetch or delete specific documents from the index.
Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined.
Subindex: Search through nested data.

🚀 Performance

Speedup HnswDocumentIndex by caching num docs (#1706)

We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time but significantly accelerates search times.

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]
docs = [MyDoc(text=‘hey’, embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir=‘tst’, index_name=‘index’)
index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start
query = docs[0]
find_start = time.time()
matches, _ = index.find(query, search_field=‘embedding’, limit=10)
find_time = time.time() - find_start

In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.

⚙ Refactoring

Put contains method in the base class (#1701)

We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.

More robust method to detect duplicate index (#1651)

We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex.

🐞 Bug Fixes

WeaviateDocumentIndex handles lowercase index names (#1711)

We've addressed an issue in the WeaviateDocumentIndex where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index. To resolve this, we've added a post_init function that capitalizes the first letter of the provided index name, ensuring consistent naming and preventing potential errors.

QdrantDocumentIndex unable to see index_name (#1705)

We've resolved an issue where the QdrantDocumentIndex was not properly recognizing the index_name parameter. Previously, the specified index_name was ignored and the system defaulted to the schema name.

Fix search in InMemoryExactNNIndex with AnyEmbedding (#1696)

From now on, you can perform search operations in InMemoryExactNNIndex using AnyEmbedding

Use safe_issubclass everywhere (#1691)

We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors

Avoid converting DocLists in the base index (#1685)

We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.

📗 Documentation Improvements

  • Add docs for dict() method (#1643)

🤟 Contributors

We would like to thank all contributors to this release: