CLIP-as-service 0.8.1 Update

CLIP-as-service is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a micro-service into neural search solutions.

CLIP-as-service logo on the left with "0.8.1 Update" text on a colorful gradient background

CLIP-as-service is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a micro-service into neural search solutions.

Release ๐Ÿ’ซ Patch v0.8.1 ยท jina-ai/clip-as-service
Release Note (0.8.1) Release time: 2022-11-15 11:15:48 ๐Ÿ™‡ Weโ€™d like to thank all contributors for this new release! In particular,YangXiuyu, Ziniu Yu, felix-wang, Jie Fu, Jina Dev Bot, numb3...
Full release note with links to pull requests

Release Note (0.8.1)

This release contains 1 new feature, 1 performance improvement, 2 bug fixes and 4 documentation improvements.

๐Ÿ†• Features

Allow custom callback in clip_client (#849)

This feature allows clip_client users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions: on_done, on_error and on_always.

The following code snippet shows how to send a request to a server and save the response to a database.

from clip_client import Client

db = {}

def my_on_done(resp):
    for doc in resp.docs:
        db[doc.id] = doc


def my_on_error(resp):
    with open('error.log', 'a') as f:
        f.write(resp)


def my_on_always(resp):
    print(f'{len(resp.docs)} docs processed')


c = Client('grpc://0.0.0.0:12345')
c.encode(
    ['hello', 'world'], on_done=my_on_done, on_error=my_on_error, on_always=my_on_always
)

For more details, please refer to the CLIP client documentation.

๐Ÿš€ Performance

Integrate flash attention (#853)

We have integrated the flash attention module as a faster replacement for nn.MultiHeadAttention. To take advantage of this feature, you will need to install the flash attention module manually:

pip install git+https://github.com/HazyResearch/flash-attention.git

If flash attention is present, CLIP Server will automatically try to use it.

The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a Tesla T4 GPU, and times how long it took to encode a batch of documents 100 times.

ModelInput dataInput shapew/o flash attentionflash attentionSpeedup
ViT-B-32text(1, 77)0.426920.378671.1274
ViT-B-32text(8, 77)0.487380.453241.0753
ViT-B-32text(16, 77)0.47640.443151.07502
ViT-B-32image(1, 3, 224, 224)0.43490.403921.0767
ViT-B-32image(8, 3, 224, 224)0.473670.453161.04527
ViT-B-32image(16, 3, 224, 224)0.515860.505551.0204

Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.

๐Ÿž Bug Fixes

Increase timeout at startup for Executor Docker images (#854)

During Executor initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, the Executor may fail due to time-out before even starting. We have increased the timeout to 3,000 seconds or 50 minutes.

Install transformers for Executor Docker images (#851)

We have added the transformers package to Executor Docker images, in order to support the multilingual CLIP model.

๐Ÿ“— Documentation Improvements

  • Update Finetuner docs (#843)
  • Add tips for client parallelism usage (#846)
  • Move benchmark conclusion to beginning (#847)
  • Add instructions for using clip server hosted by Jina (#848)

๐ŸคŸ Contributors

We would like to thank all contributors to this release: