DocArray 0.20 Update

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc.

Promotional graphic for "docarray" with title "What's new in 0.20?" against a starry blue and purple backdrop and "SANDBOX PR

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multi-modal data with a Pythonic API.

πŸ’‘
DocArray was released under the open-source Apache License 2.0 in January 2022. It is currently a sandbox project under LF AI & Data Foundation.

DocArray is the common data layer used in all Jina AI products.
Release πŸ’« Release v0.20.0 Β· docarray/docarray
Release Note (0.20.0) Release time: 2022-12-07 12:15:30 This release contains 7 new features, 3 bug fixes and 7 documentation improvements.πŸ†• FeaturesMilvus document store (#587)This release su...

Release Note (0.20.0)

This release contains 8 new features, 3 bug fixes and 7 documentation improvements.

πŸ†• Features

Milvus document store (#587)

This release supports the Milvus vector database as a document store.

da = DocumentArray(storage='milvus', config={'n_dim': 3))

Root_id for document stores (#808)

When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).

top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)

To allow this we now store the root_id in the chunks' tags. You can enable this by passing root_id=True in your document store configuration.

Filtering based on text keywords for Qdrant (#849)

You can now filter based on text keywords for the Qdrant document store.

filter = {
    'must': [
        {"key": "info", "match": {"text": "shoes"}}
    ]
}

results = da.find(np.random.rand(n_dim), filter=filter)

RGB-D representation of 3D meshes (#753)

DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.

doc.load_uris_to_rgbd_tensor()

Load multi page tiff files into chunks (#845)

Multi page tiff images can now be loaded with load_uri_to_image_tensor().

d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
  └─ chunks
     β”œβ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
     β”œβ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
     └─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>

Store key frame indices when loading video tensor from uri (#880)

key_frame_indices are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.

d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]

Better plotting of embeddings for nested and complex data (#891)

You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding() method. This makes it easier to plot embeddings for complex and nested data.

docs.plot_embeddings(exclude_fields_metas=['chunks'])

Better support for information retrieval evaluation (#826)

This release adds a max_rel_per_label parameter to better support metric calculations that require the number of relevant Documents.

metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})

🐞 Bug Fixes

Support length calculation independently from list-like behavior (#840)

Our prior minor release, DocArray 0.19, added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.

Remove cosine similarity field with false assignment (#835)

In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.

Rebuild index after clearing storage (#837)

The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage is called.

πŸ“— Documentation Improvements

  • Correct Document description (#842)
  • Minor correction in Document description (#834)
  • Add username to DocArray pull (#847)
  • Fix broken docs (#805)
  • Fix data management section (#801)
  • Change logic order according to blog (#797)
  • Move cloud support to integrations (#798)

🀘 Contributors

We would like to thank all contributors to this release: