DocArray 0.32 Update

DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.

Docarray logo on a galaxy-themed background with text 'What's new in 0.32?' and 'SANDBOX PROJECT' below

Release Note (v0.32.0)

This release contains 4 new features, 5 bug fixes and 4 documentation improvements.

Release ๐Ÿ’ซ Release v0.32.0 ยท docarray/docarray
Release Note (v0.32.0)This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.๐Ÿ†• FeaturesSubindex for document index (#1428)The subindex feโ€ฆ

๐Ÿ†• Features

Subindex for document index (#1428)

The subindex feature allows you to index documents that contain another DocList by automatically creating a separate collection/index for each such DocList:

# create nested document schema
class SimpleDoc(BaseDoc):
    tensor: NdArray[10]
    text: str


class MyDoc(BaseDoc):
    docs: DocList[SimpleDoc]


# create some docs
my_docs = [
    MyDoc(
        docs=DocList[SimpleDoc](
            [
                SimpleDoc(
                    tensor=np.ones(10) * (j + 1),
                    text=f"hello {j}",
                )
                for j in range(10)
            ]
        ),
    )
]

# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs)  # index with name 'idx' and 'idx__docs' will be generated

# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
    query, search_field="docs__tensor", limit=5
)

OpenAPI and FastAPI tensor shapes (#1510)

We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.

This means that you can now build web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:

class Doc(BaseDoc):
    embedding_torch: TorchTensor[3, 4]


app = FastAPI()


@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
    return Doc(embedding=doc.embedding_np)

Generated Swagger UI:

image
image

Save and load in-memory index (#1534)

We added a persist method to the InMemoryExactNNIndex class to save the index to disk.

# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')

๐Ÿž Bug Fixes

search_field should be optional in hybrid text search (#1516)

We have added a sane default to text_search() for the search_field argument that is now Optional.

Check if file path exists for in-memory index (#1537)

We have added an internal check to see if index_file_path exists when passed to InMemoryExactNNIndex.

Add empty judgement to index search (#1533)

We have ensured that empty indices do not fail when find is called.

Detach torch tensors (#1526)

Serializing tensors with gradients no longer fails.

Docvec display fixes (#1522)

We have resolved Docvec display issues.

๐Ÿ“— Documentation Improvements

  • Remove erroneous info (#1531)
  • Fix link to documentation in readme (#1525)
  • Flatten structure (#1520)
  • Fix links (#1518)

๐Ÿค˜ Contributors

We would like to thank all contributors to this release: