Releases

DocArray 0.17 Update

Engineering Group

Sep 24, 2022 • 5 min read

DocArray 0.17 is a minor release. It contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.

DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.

💡

DocArray is the common data layer used in all Jina AI products.

Full release note with links to pull requests

Release Note (`0.17.0`)

Release time: 2022-09-23 16:18:19

This release contains 8 new features, 2 performance improvements, 7 bug fixes, and 2 documentation improvements.

🆕 Features

Allow passing parameters to `load_uri_to_*` methods (#540)

The load_uri_to_* methods (load_uri_to_blob, load_uri_to_text, etc.) now accept kwargs so that you can pass a timeout parameter to the underlying request methods.

For example:

doc = Document(uri='uri_path')
doc.load_uri_to_blob(timeout=2)

Allow multiple DocumentArrays per Redis server (#540)

You can now store multiple DocumentArrays in a single Redis instance, as long as each DocumentArray has a different index_name:

da1 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 128, 'index_name': 'da1'})
da2 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 256, 'index_name': 'da2'})
da3 = DocumentArray(storage='redis', config={'host': 'localhost', 'port': 6379, 'n_dim': 512, 'index_name': 'da3'})

Logging in to Jina Cloud is now required before pushing/pulling DocumentArrays to/from Jina Cloud. You can log in either by creating a token in hub.jina.ai and setting it as an environment variable (JINA_AUTH_TOKEN=my_token) or using the CLI command jina auth login.

Push metadata along with DocumentArray and add `cloud_list` and `cloud_delete` methods (#490)

DocumentArray.push will extract metadata about the DocumentArray and send it to Jina Cloud. Although this is transparent to users, it will help with visualization of DocumentArrays in Jina Cloud.

It is also possible to list and delete DocumentArrays in Jina Cloud using the following methods:

DocumentArray.cloud_list(): will list all DocumentArray objects owned by the authenticated user
DocumentArray.cloud_delete(da_name): will delete the DocumentArray by name if it is owned by the authenticated user

Full text search support in Redis backend (#535)

Full text search is supported either on the Document.text field or on Document tags as long as you enable indexing text or specify tag fields to be indexed.

For example:

from docarray import Document, DocumentArray

da = DocumentArray(
    storage='redis', config={'n_dim': 2, 'index_text': True}
)
da.extend([
    Document(text='Redis allows you to search by text query,'),
    Document(text='by vector similarity'),
    Document(text='Or by filter conditions'),
]) # add documents with text field

da.find('my text query').texts

Result:

['Redis allows you to search by text query,']

Add logical operators `$and` and `$or` in Redis (#509)

The Redis backend now supports $and and $or logical operators. For example:

from docarray import DocumentArray

da = DocumentArray(storage='redis', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'int'}})

redis_filter = {
    "$or": {
        "col1": {"$eq": "value"},
        "col2": {"$lt": 100}
    }
}

# retrieve documents using filter
da.find(redis_filter)

Columns in backend configuration should be a dictionary, not a list of tuples (#526)

The columns configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format: {'column_name': 'column_type'}. This helps with YAML compatibility.

For example:

from docarray import DocumentArray
da = DocumentArray(storage='annlite', config={'n_dim': 128, 'columns': {'col1': 'str', 'col2': 'float'}})

Allow displaying image documents using either tensor or URI (#518)

It is now possible to choose which field to use when displaying an image document:

from docarray import Document

d = Document(uri=os.path.join(cur_dir, 'toydata/test.png'))
d.display()
d.display(from_='uri')

d.load_uri_to_image_tensor()
d.display(from_='tensor')

Backwards incompatible API changes

Increased minimum versions for dependencies:

Package	Minimum Version
`jina-hubble-sdk`	`0.13.1`
`annlite`	`0.3.12`

Other API Changes:

The columns configuration parameter for storage backends has been changed from a list of tuples to a dictionary in the following format: {'column_name': 'column_type'}.

🚀 Performance

Optimize find with an `exists` condition (#519)

We got rid of unnecessary and costly computation when computing DocumentArray.find with an exists filter. When running the following code:

from docarray import DocumentArray, Document
da = DocumentArray(Document(text='text') for _ in range(num)) + \
     DocumentArray(Document(blob=b'blob') for _ in range(num))


da.find(query={'text': {'$exists': True}})

you should expect a 200-300% speed increase in find latency.

This optimization only affects performing DocumentArray.find or DocumentArray.match when an exists condition is used and an in-memory document store is used.

Change default journal mode to WAL in SQLite backend (#506):

The default journal mode in the SQLite backend is now WAL. This should improve performance when using the SQLite backend.

According to the SQLite docs, WAL is significantly faster, provides more concurrency, and is more robust.

🐞 Bug Fixes

Keep default values for vector similarity parameters in Redis backend (#559)

DocumentArray's Redis backend previously initialized schemas in the Redis database with default values of vector similarity search parameters. Those default values came from DocArray, not Redis.

This altered the database's default behavior, although the user didn't explicitly specify that. We've changed the implementation to avoid altering default values of the database. Default values now depend on the Redis database version.

Adapt to AnnLite changes (#543)

AnnLite introduced a breaking change in 0.3.12. Therefore, we have adapted our implementation to the latest version of AnnLite and increased the minimum required version to 0.3.12.

Keep out of mask docs in delete by mask (#534)

DocumentArray's delete by mask operation used to present an unexpected behavior. The following code erases the last Document, even though it is not covered by the mask:

da = DocumentArray.empty(3)
mask = [True, False]
del da[mask]
print(len(da))  # prints 1

We have fixed this behavior, and DocumentArray will now correctly keep documents that are not present in the mask.

Fix Finetuner link for Totally Looks Like (#532)

We've fixed an incorrect link in the documentation.

Fix AnnLite type map (#533)

DocArray type mapping used the wrong types in AnnLite. We've now replaced the types specified in the document store implementation with the correct ones.

Create Strawberry types with kwargs (#527)

Strawberry introduced a breaking change in 0.128.0, making it necessary to pass parameters as key arguments. We've adapted our code base to this change.

Make device more generic (#515)

Some parts of in-memory distance computation used to restrict tensor device conversion to cuda. We've changed the implementation to make device conversion more generic.

📗 Documentation Improvements

Add benchmark reference to feature summary (#510)

We've added a "One Million Benchmark" section to the "Feature Summary" page.

Update push/pull setup instructions (#516)

We've updated the pip setup instruction required to use DocumentArray push/pull.

🤟 Contributors

We would like to thank all contributors to this release: Joan Fontanals(@github_user)
Leon Wolf(@fogx)
samsja(@samsja)
AlaeddineAbdessalem(@alaeddine-13)
Halo Master(@linkerlin)
Han Xiao(@hanxiao)
Wang Bo(@bwanglzu)
Anne Yang(@AnneYang720)
Joan Fontanals(@JoanFM)

Release Note (0.17.0)

🆕 Features

Allow passing parameters to load_uri_to_* methods (#540)

Allow multiple DocumentArrays per Redis server (#540)

Login required for DocumentArray push and pull (#541)

Push metadata along with DocumentArray and add cloud_list and cloud_delete methods (#490)

Full text search support in Redis backend (#535)

Add logical operators $and and $or in Redis (#509)

Columns in backend configuration should be a dictionary, not a list of tuples (#526)

Allow displaying image documents using either tensor or URI (#518)

Backwards incompatible API changes

Increased minimum versions for dependencies:

🚀 Performance

Optimize find with an exists condition (#519)

Change default journal mode to WAL in SQLite backend (#506):

🐞 Bug Fixes

Keep default values for vector similarity parameters in Redis backend (#559)

Adapt to AnnLite changes (#543)

Keep out of mask docs in delete by mask (#534)

Fix Finetuner link for Totally Looks Like (#532)

Fix AnnLite type map (#533)

Create Strawberry types with kwargs (#527)

Make device more generic (#515)

📗 Documentation Improvements

Add benchmark reference to feature summary (#510)

Update push/pull setup instructions (#516)

🤟 Contributors

Sign up for more like this.

Release Note (`0.17.0`)

Allow passing parameters to `load_uri_to_*` methods (#540)

Push metadata along with DocumentArray and add `cloud_list` and `cloud_delete` methods (#490)

Add logical operators `$and` and `$or` in Redis (#509)

Optimize find with an `exists` condition (#519)