DocArray 0.17 Update

Gradient background from pink to purple with "docarray" logo on the left and "0.17.0 Update" centered
DocArray 0.17 is a minor release. It contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.

💡
DocArray is the common data layer used in all Jina AI products.
Release 💫 Release v0.17.0 · jina-ai/docarray
Release Note (0.17.0) Release time: 2022-09-23 16:18:19 This release contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.🆕 FeaturesAllow passing p...
Full release note with links to pull requests

Release Note (0.17.0)

Release time: 2022-09-23 16:18:19

This release contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.

🆕 Features

Allow passing parameters to load_uri_to_* methods (#540)

The load_uri_to_* methods (load_uri_to_blob, load_uri_to_text, etc.) now accept kwargs so that you can pass a timeout parameter to the underlying request methods.

For example:

doc = Document(uri='uri_path')
doc.load_uri_to_blob(timeout=2)

Allow multiple DocumentArrays per Redis server (#540)

You can now store multiple DocumentArrays in a single Redis instance, as long as each DocumentArray has a different index_name:

da1 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 128, 'index_name': 'da1'})
da2 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 256, 'index_name': 'da2'})
da3 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 512, 'index_name': 'da3'})

Login required for DocumentArray push and pull (#541)

Logging in to Jina Cloud is now required before pushing/pulling DocumentArrays to/from Jina Cloud. You can log in either by creating a token in hub.jina.ai and setting it as an environment variable (JINA_AUTH_TOKEN=my_token) or using the CLI command jina auth login.

Push metadata along with DocumentArray and add cloud_list and cloud_delete methods (#490)

DocumentArray.push will extract metadata about the DocumentArray and send it to Jina Cloud. Although this is transparent to users, it will help with visualization of DocumentArrays in Jina Cloud.

It is also possible to list and delete DocumentArrays in Jina Cloud using the following methods:

  • DocumentArray.cloud_list(): will list all DocumentArray objects owned by the authenticated user
  • DocumentArray.cloud_delete(da_name): will delete the DocumentArray by name if it is owned by the authenticated user

Full text search support in Redis backend (#535)

Full text search is supported either on the Document.text field or on Document tags as long as you enable indexing text or specify tag fields to be indexed.

For example:

from docarray import Document, DocumentArray

da = DocumentArray(
    storage='redis', config={'n_dim': 2, 'index_text': True}
)
da.extend([
    Document(text='Redis allows you to search by text query,'),
    Document(text='by vector similarity'),
    Document(text='Or by filter conditions'),
]) # add documents with text field

da.find('my text query').texts

Result:

['Redis allows you to search by text query,']

Add logical operators $and and $or in Redis (#509)

The Redis backend now supports $and and $or logical operators. For example:

from docarray import DocumentArray

da = DocumentArray(storage='redis', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'int'}})

redis_filter = {
    "$or": {
        "col1": {"$eq": "value"},
        "col2": {"$lt": 100}
    }
}

# retrieve documents using filter
da.find(redis_filter)

Columns in backend configuration should be a dictionary, not a list of tuples (#526)

The columns configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format: {'column_name': 'column_type'}. This helps with YAML compatibility.

For example:

from docarray import DocumentArray
da = DocumentArray(storage='annlite', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'float'}})

Allow displaying image documents using either tensor or URI (#518)

It is now possible to choose which field to use when displaying an image document:

from docarray import Document

d = Document(uri=os.path.join(cur_dir, 'toydata/test.png'))
d.display()
d.display(from_='uri')

or

d.load_uri_to_image_tensor()
d.display(from_='tensor')

Backwards incompatible API changes

Increased minimum versions for dependencies:

Package Minimum Version
jina-hubble-sdk 0.13.1
annlite 0.3.12

Other API Changes:

  • The columns configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format: {'column_name': 'column_type'}.

🚀 Performance

Optimize find with an exists condition (#519)

We got rid of unnecessary and costly computation when computing DocumentArray.find with an exists filter. When running the following code:

from docarray import DocumentArray, Document
da = DocumentArray(Document(text='text') for _ in range(num)) + \
     DocumentArray(Document(blob=b'blob') for _ in range(num))


da.find(query={'text': {'$exists': True}})

you should expect a 200-300% speed increase in find latency.

This optimization only affects performing DocumentArray.find or DocumentArray.match when an exists condition is used and an in-memory document store is used.

Change default journal mode to WAL in SQLite backend (#506):

The default journal mode in the SQLite backend is now WAL. This should improve performance when using the SQLite backend.

According to the SQLite docs, WAL is significantly faster, provides more concurrency, and is more robust.

🐞 Bug Fixes

Keep default values for vector similarity parameters in Redis backend (#559)

DocumentArray's Redis backend previously initialized schemas in the Redis database with default values of vector similarity search parameters. Those default values came from DocArray, not Redis.

This altered the database's default behavior, although the user didn't explicitly specify that. We've changed the implementation to avoid altering default values of the database. Default values now depend on the Redis database version.

Adapt to AnnLite changes (#543)

AnnLite introduced a breaking change in 0.3.12. Therefore, we have adapted our implementation to the latest version of AnnLite and increased the minimum required version to 0.3.12.

Keep out of mask docs in delete by mask (#534)

DocumentArray's delete by mask operation used to present an unexpected behavior. The following code erases the last Document, even though it is not covered by the mask:

da = DocumentArray.empty(3)
mask = [True, False]
del da[mask]
print(len(da))  # prints 1

We have fixed this behavior, and DocumentArray will now correctly keep documents that are not present in the mask.

We've fixed an incorrect link in the documentation.

Fix AnnLite type map (#533)

DocArray type mapping used the wrong types in AnnLite. We've now replaced the types specified in the document store implementation with the correct ones.

Create Strawberry types with kwargs (#527)

Strawberry introduced a breaking change in 0.128.0, making it necessary to pass parameters as key arguments. We've adapted our code base to this change.

Make device more generic (#515)

Some parts of in-memory distance computation used to restrict tensor device conversion to cuda. We've changed the implementation to make device conversion more generic.

📗 Documentation Improvements

Add benchmark reference to feature summary (#510)

We've added a "One Million Benchmark" section to the "Feature Summary" page.

Update push/pull setup instructions (#516)

We've updated the pip setup instruction required to use DocumentArray push/pull.

🤟 Contributors

We would like to thank all contributors to this release: Joan Fontanals(@github_user)
Leon Wolf(@fogx)
samsja(@samsja)
AlaeddineAbdessalem(@alaeddine-13)
Halo Master(@linkerlin)
Han Xiao(@hanxiao)
Wang Bo(@bwanglzu)
Anne Yang(@AnneYang720)
Joan Fontanals(@JoanFM)