DocArray 0.21 Update

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API.

Promotional image for docarray version 0.21 with a galaxy background, featuring the text "AI & DATA Sandbox Project"

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API.

πŸ’‘
DocArray was released under the open-source Apache License 2.0 in January 2022. It is currently a sandbox project under LF AI & Data Foundation.

DocArray is the common data layer used in all Jina AI products.
Release πŸ’« Release v0.21.0 Β· docarray/docarray
Release Note (0.21.0) Release time: 2023-01-17 09:10:50 This release contains 3 new features, 7 bug fixes and 5 documentation improvements.πŸ†• FeaturesOpenSearch Document Store (#853)This versio...

Release Note (0.21.0)

This release contains 3 new features, 7 bug fixes and 5 documentation improvements.

πŸ†• Features

OpenSearch Document Store (#853)

This version of DocArray adds a new Document Store: OpenSearch!

You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:

from docarray import Document, DocumentArray
import numpy as np

# Connect to OpenSearch instance
n_dim = 3

da = DocumentArray(
    storage='opensearch',
    config={'n_dim': n_dim},
)

# Index Documents
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim))
            for i in range(10)
        ]
    )

# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)

Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.

Learn more about its usage in the official documentation.

Add color to point cloud display (#961)

You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor():

coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')

doc = Document(
    tensor=coords,
    chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()
image

Add language attribute to Redis Document Store (#953)

The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language parameter in the Redis configuration:

da = DocumentArray(
    storage='redis',
    config={
        'n_dim': 128,
        'index_text': True,
        'language': 'chinese',
    },
)

🐞 Bug Fixes

Replace newline with whitespace to fix display in plot embeddings (#963)

Whenever the string '\n' was contained in any Document field, doc.plot() would result in a rendering error. This fixes those errors be rendering '\n' as whitespace.

Fix unwanted coercion in to_pydantic_model (#949)

This bug caused all strings of the form 'Infinity' to be coerced to the string 'inf' when calling to_pydantic_model() or to_dict(). This is fixed now, leaving such strings unchanged.

Calculate relevant docs on index instead of queries (#950)

In the embed_and_evaluate() method, the number of relevant Documents per label used to be calculated based on the Document in self. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.

Remove offset index create on list like false (#936)

When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.

Add support for remote audio files (#933)

Loading audio files from a remote URL would cause FileNotFoundError, which is now fixed.

Query operator $exists does not work correctly with tags (#911) (#923)

Before this fix, $exists would treat false-y values such as 0 or [] as non-existent. This is now fixed.

Document from dataclass with singleton list (#1018)

When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.

πŸ“— Documentation Improvements

  • Link to Discord (#1010)
  • Have less versions to avoid deployment timeout (#977)
  • Fix data management section not appearing in documentation (#967)
  • Link to OpenSearch docs in sidebar (#960)
  • Multimodal to datatypes (#934)

🀘 Contributors

We would like to thank all contributors to this release: