Knowledge Base

Cloud-Native Helps You Build Multimodal AI in Production: Here's How

How would cloud-native actually help you on building multimodal AI in production? What is cloud-native anyway? Is it just a buzzword to get VC's money yet delivers an empty promise?

Jina AI

Nov 24, 2022 • 10 min read

Jina is the framework for building multimodal AI applications on the cloud. With Jina, developers can easily build high-performance cloud-native applications, services, and systems in production. But at this point, you're not buying those campaign words: How would cloud-native actually help you? What is cloud-native anyway? Is it just a buzzword to get VC's money yet delivers an empty promise?

In a parallel universe, this is VC's logic when seeing the phrase "cloud-native". (Un?)Fortunately in our universe, VCs are much more rational.

In this article, we start with a project from Bob, a machine learning engineer who just joined an e-commerce company. Bob's task is to build a shop-the-look service, allowing users to upload a photo and search visually similar items from the stock. We will walk through Bob's development journey and then answer the questions above.

Source: Van Gils - Spring Summer 2019 collection on Behance. licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Bob's development journey

Shop-the-look system sounds cool and deep learning-related, exactly what Bob's expertise is, so Bob quickly starts to prototype.

1. Prototyping

There are two modes to the shop-the-look system:

Indexing, which creates a visual representation of all items in stock.
Search, which takes a user-uploaded photo and finds visually similar items in stock.

The indexing part creates a visual representation of all the stock items. To do this, Bob has to extract features from the images to create the visual representation. These could be extracted using a convolutional neural network, then stored in a database.

The search takes a user-uploaded photo and finds visually similar items from stock. Bob needs to extract features from the user-uploaded photo using a convolutional neural network. Then, he can use a similarity metric (like cosine similarity) to find visually similar items from the stock.

At this point, Bob needs a deep learning framework like PyTorch, some key-value database like MongoDB, and possibly some vector search engine like Faiss or Elasticsearch. As a machine learning engineer, Bob is mostly familiar with PyTorch and prototyping. Bob is smart and full of energy, so there’s nothing he can’t learn. He easily glues them together as the first proof of concept (POC).

Prototype, the sexiest thing in the engineering world

2. Make it a service

Is it done? Oh hell no, it's just getting started. Instead of a bunch of Python functions, Bob's goal is to make it a web service so that its IO goes through the network. To do that, Bob needs to refactor the above logic in some web framework with some API so that it can be called by other services.

There are many ways to do this: One example would be to use the Django web framework. Bob creates an endpoint that accepts user-uploaded photos, then uses the above logic to find visually similar items from stock. Finally, results are returned to the user as a JSON object.

At this point, Bob has learned a few new things such as REST API, web service, and web framework, which seem to go beyond Bob's scope as a “machine learning engineer”. Bob wondered whether it was even worth learning them. But a machine learning engineer is an engineer, and learning new things is always good. But deep down Bob feels that his engineering skills may not be enough to make it into production. After some time, Bob managed to glue everything together.

SaaS, the sexiest thing in the business world.

3. Deploy it to the cloud

The product team is impressed by Bob's progress and asks him to deploy it on AWS to serve some real traffic for the A/B test. This is exciting because it means Bob's PoC will face the public and have real users. Bob encountered many problems while migrating from local to the cloud, mostly because of dependency issues, CUDA drivers, and GPU hassles. He finally solved them by wrapping everything in a 30GB Docker image. It's a big monolith container, but it is easy to deploy and manage.

4. Improve scalability and performance

Is it done now? Not yet. The product team wants to ensure certain scalability of the service in practice, meaning that feature extraction should be parallelized and concurrent user requests should be handled without lag. A certain QPS (query per second) is required by the product team.

Bob tried the straightforward multiprocessing and threading, but nothing works out of the box with his deep learning stack. Bob decides to learn more high-performance computing frameworks such as Dask or Ray and try to adopt them. After some trial and error, Bob finally glues everything together and makes them work. At this point, Bob feels exhausted as it diverges too far from his expertise.

Scalability is now so important that Bob is willing to give up the other two legs.

5. Ensure availability and minimize downtime

“What if our database is down (due to an update) for a short time?”

So Bob designs some naive failsafe mechanism that he just learned from a blog post. Bob also picks up some AWS service in a rush to ensure availability of the service, hoping it can be fire-and-forget.

6. Gain observability

“How can I see the incoming traffic?”

Bob changes all print to logger.info and impatiently spins up a dashboard.

7. Solve security concern

“Can we add some authentication header to it?”

“Is this service prone to attack?”

At this point, Bob is burning out. It strays too far away from his expertise. Bob decides to hand over the project to a senior backend engineer, a new hire who has a lot of experience in infrastructure engineering and cloud services. She knows what she's doing and is willing to help Bob.

So Bob sits down with her, scrolling over the glued code and justifying all his tricks, and design decisions, and explaining all the caveats. The senior keeps nodding, and Bob sees it as some kind of recognition. Soon after, the senior takes a slow and thoughtful sip of her coffee and says:

“Why don’t we rewrite it?”

Lessons learned

The above example is real because Bob could be you, me or any machine learning engineer. Most importantly, it reveals some gaps when developing a multimodal AI system in production:

First is the lack of a design pattern for such a system. It's unclear how you should represent, compute, store, and transmit the data with different modalities consistently; and how you can switch between different tools and avoid glue code.
Second is the large gap between a proof-of-concept and a production system. For a production system, cloud-native techniques are often required to ensure the professionalism and scalability of the system. In particular, microservices, orchestration, containerization, and observability are four pillars of such a system. However, the learning curve is just too steep for many machine learning engineers, preventing them from building production-ready systems.
The third is the long go-to-market time. If a company chooses the wrong tech stack, it takes longer to bring the product to market. This is because the company has to spend more time and resources on developing the product, refactoring it, and going back and forth. In addition, the wrong stack can cause problems with the product itself, raising the risk of the product being unsuccessful.

Cloud-native saves the world

"Cloud-native" is a term that refers to a system designed to run on the cloud. It consists of a group of concepts:

Microservices: The building blocks of a cloud-native system.
Orchestration: The process of managing the microservices.
Containerization: The process of packaging the microservices into containers.
Observability: The process of monitoring the system.
DevOps and CI/CD: The process of automating integration of the system.

Sounds cool but irrelevant. Do we really need all of these?

Yes! The table below summarizes the reasons nicely:

Characteristics of a multimodal AI system	How cloud-native helps
A multimodal system is not a single task. It usually consists of multiple components and forms a workflow or multiple workflows (e.g. indexing and search)	Microservice each task, then orchestrate them into workflows.
A multimodal system involves complicated package dependencies.	Containerization comes to help to ensure reproducibility and isolation.
A multimodal system is often a backend/infrastructure service that requires extra stability.	DevOps and CI/CD guarantee the integration, and observability provides the health information of the system.

The taste of Jina

With that, let's look at what Jina promises: Jina provides a unified, cloud-native solution for building multimodal systems from the get go. It provides the best developer experience from day one PoC to production. No more tech debt, no more refactoring, and no more back and forth between different systems.

Now it starts to make sense, right? Let’s get our first taste of how a Jina project looks and works.

We'll look at a simple Jina hello world example: We write a function that appends "hello, world" to a Document, apply that function twice on two Documents, then return and print their texts.

from jina import DocumentArray, Executor, Flow, requests


class MyExec(Executor):
    @requests
    async def foo(self, docs: DocumentArray, **kwargs):
        for d in docs:
            d.text += 'hello, world!'


f = Flow().add(uses=MyExec).add(uses=MyExec)

with f:
    r = f.post('/', DocumentArray.empty(2))
    print(r.texts)

────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓     Protocol                    GRPC  │
│  🏠       Local           0.0.0.0:52570  │
│  🔒     Private     192.168.1.126:52570  │
│  🌍      Public    87.191.159.105:52570  │
╰──────────────────────────────────────────╯

['hello, world!hello, world!', 'hello, world!hello, world!']

It's a pretty straightforward program. It abstracts away the complexity of a real multimodal system, leaving only the basic logic: Make a data structure, operate on it, and return the result.

You can achieve the same in 14 lines of code (blacked) with pure Python.

class Document:
    text: str = ''


def foo(docs, **kwargs):
    for d in docs:
        d.text += 'hello, world!'


docs = [Document(), Document()]
foo(docs)
foo(docs)
for d in docs:
    print(d.text)

So, why learn Jina when you could achieve the same result with pure Python? What’s the big deal?

Here’s the deal: The following features come out of the box with the above 14 lines of Jina code:

Replicas, sharding, and scalability in just one line of code
Client/server architecture with duplex streaming
Async non-blocking data workflow
gRPC, WebSockets, HTTP, GraphQL gateway support
Microservice from day one, seamless Docker containerization
Explicit version and dependency control
Reusable building blocks from the Executor marketplace
Immediate observability via Prometheus and Grafana
Seamless Kubernetes integration

If you think that’s a lot of overpromising, it's not. It barely scratches the surface of Jina’s capabilities.

Design principles

With so many powerful features, you might assume that the learning curve of Jina must be very steep. You couldn't be more wrong. You only need to know three concepts to master Jina: Document, Executor, and Flow, which are introduced in Basic Concepts. In particular:

The data type: represents the common data structure across the system; this corresponds to “Document” in Jina.
The logic: represents the product logic of each component; this corresponds to “Executor” in Jina.
The orchestration: represents the workflow of all components; this corresponds to “Flow” in Jina.

These are all you need.

The patterns are nice, and the cloud-native features are cool. But what’s the point if you need to spend months learning them? Jina’s design principles are simple and clear: flatten the learning curve of cloud-native techniques and make all awesome production-level features easily accessible.

Relation to MLOps

MLOps, or DevOps for machine learning, is the practice of combining DevOps practices and tools with machine learning workflows. Jina shares the similar aim of MLOps to automate and improve the process of building, training, and deploying machine learning models.

The benefits of using Jina include:

Faster and more reliable machine learning model development: Automating machine learning workflows (Flows in Jina) speeds up model development and makes the process more reliable.
Increased collaboration between ML scientists and engineers: Using Jina helps to improve communication and collaboration between ML scientists and engineers, as the latter can be more involved in the machine learning model development process.
Improved model (Executor in Jina) quality: Automating machine learning workflows ensures models are of high quality, and improves model testing and validation.
Infrastructure as code (Flow YAML in Jina): Infrastructure as code is another key DevOps practice that can be applied to machine learning. It involves using code to provision and manage infrastructure, making it more flexible and scalable.
Monitoring and logging: Monitoring and logging are important to ensure the quality of machine learning models. They help to identify errors and issues early on so that they can be fixed quickly.
Model management (Executor Hub): Model management keeps track of different versions of machine learning models, ensuring the correct model is used for each task.

Bob is happy with his new gadget and doesn't have to play with glue anymore.

Summary

Before Jina, building multimodal AI on the cloud was tedious and time-consuming. It involved a lot of glue code, back-and-forth refactoring, and after months you eventually ended up with a fragile and unsustainable system. With Jina, you're doing it professionally from day one: from PoC to production; from deployment to scaling; from monitoring to analytics; everything is handled by Jina. We know the pain, we've learned the lessons ourselves, and that’s why we built Jina: to make developing multimodal AI easier, more productive, and enjoyable.