Making Visuals Vocal: SceneXplain's Impact on Product Image Accessibility

SceneXplain transforms product images into audio descriptions, ensuring visual content isn't just seen, but also heard and understood. It's a step forward in creating an inclusive digital world for everyone

Joyful woman with arms outstretched against a vibrant orange and blue background, with the phrase "Making Visuals Vocal" besi

In a world that is increasingly connected and digital, we're experiencing an unprecedented volume of visual content. Yet, for a significant portion of the population, these visuals remain inaccessible. Visual accessibility, often referred to as 'A11y', is a cornerstone of inclusive digital design. It ensures that information is not only available but also perceivable and understandable to all, including people with visual impairments.

SceneXplain, Jina AI's image analysis tool, is here to transform online accessibility. Now you can leverage the power of artificial intelligence to generate audio descriptions for product images. With SceneXplain, images aren't just seen; they're heard and understood, ensuring that the growing volume of visual content is accessible to everyone.

Why is this so critical? For online businesses, making product images accessible can broaden their customer base, increase customer satisfaction, and improve overall brand reputation. It isn't just about inclusivity; it's also a strategic business decision.

That’s not to mention the legal side of things! We’ve written about this extensively already: Check the following post:

Enhancing Digital Accessibility: How SceneXplain Transforms Multimedia Content for Public Sector Organizations
Explore SceneXplain’s impact on digital accessibility, providing exceptional image descriptions and ensuring compliance with European standards for public sector organizations.

Bringing images to life with textual descriptions

In essence, SceneXplain's voiceover descriptions for product images are a significant step towards digital inclusivity.

0:00
/

Let's explore how this powerful tool works and the far-reaching impact it could have on our digital world. In this post, we'll look at how SceneXplain can transform the world of fashion e-commerce.

🖼️
We'll be using freely available images from Pexels.com in this article.

Since the day of its launch, SceneXplain has been able to use its large-language-model powers to give you a description of what's in an image:

In an outdoor setting featuring a mix of bold and vibrant colors, a
woman, fashionably dressed, poses confidently for a photograph. She is
clad in a green puffy jacket, accompanied by purple pants, and stylish
blue boots. Her head is slightly tilted, and she gently smiles at the
camera. The environment around her comprises two distinct red doors
with white windows, adding a pop of color to the scene. Underneath her
feet, a textured wooden surface supports her stance and adds an earthy
element to the picture.
🧠
For this blog post we're going to use the "Comet" algorithm. In our experience, it is less "creative" than the default Dune algorithm. That's what we want, considering we need accurate information rather than deep scene descriptions.

That's a nice description of the image - but if we're talking fashion, we want descriptions...of the fashion. We can do that with visual question answering by asking:

Describe each of the items of clothing in this image, in high detail,
in the form of a numbered list. Include information about color, cut,
style, fit. Do not mention the human in the picture.

Listen up: Converting images into audio descriptions

If you're an e-commerce store, you likely have some details about the specific item you're trying to showcase in your image. Often you just want to sell one part of the fashion ensemble, not the whole outfit. So assuming you've specified your item as (for example) puffy green jacket, you could ask for that in more detail:

Write a detailed product description for the puffy green jacket in
this image, focusing on color, style, cut and fit (if mentioned)

At the same time, let’s go to the add-ons menu and tick the box for an audio description:

We get the output:

Introducing our stylish and cozy green puffer coat, designed to
elevate your winter wardrobe while keeping you warm and comfortable.
This eye-catching coat features a vibrant green color that stands out
in the colder months, making a fashion-forward statement. The puffer
coat is thoughtfully designed with a contemporary and trendy cut,
ensuring a flattering fit while providing ample room for layering and
insulation.

The high-quality construction promises durability and warmth, while
the on-trend style showcases your fashion sense. With its effortlessly
chic vibe and standout color, this puffy green jacket is the perfect
addition to any urban adventure, blending playfulness and contrast
into a single piece. Accentuate your winter ensemble with this
irresistible, fashion-forward puffer coat and experience pure comfort
and style in the colder months.

And because we also asked for a voiceover, the following audio was created:

audio-thumbnail
Green jacket
0:00
/0:50

Breaking language barriers: Multilingual audio descriptions

But what about if you're selling in multiple countries? How can you get all of this in Spanish, French, and German? You can simply specify those in the add-ons menu:

Et voilà, we have output in more languages:

Spanish:

Presentamos nuestra elegante y vibrante chaqueta verde hinchada,
diseñada para hacer una declaración en cualquier entorno. Esta
llamativa pieza está elaborada con un tono verde audaz, lo que la
convierte en una adición destacada a cualquier guardarropa. La
chaqueta cuenta con un diseño contemporáneo, con una elegante silueta
hinchada que agrega volumen sin sacrificar la comodidad. El corte
distintivo de esta pieza asegura un ajuste favorecedor, con una
cuidadosa atención al detalle en su construcción. Abraza tu confianza
y muestra tu sentido único de la moda con esta chaqueta verde hinchada
imprescindible.
audio-thumbnail
Jacket spanish
0:00
/0:34

German:

Wir stellen unsere stilvolle und lebendige Puffy Green Jacket vor, die
in jeder Umgebung ein Statement setzt. Dieses auffällige Stück ist mit
einem kräftigen Grünton gefertigt, was es zu einer herausragenden
Ergänzung für jede Garderobe macht. Die Jacke zeichnet sich durch ein
zeitgemäßes Design aus und zeichnet sich durch eine schicke, bauschige
Silhouette aus, die Volumen verleiht, ohne auf Komfort zu verzichten.
Der markante Schnitt dieses Stücks sorgt für eine schmeichelhafte
Passform, mit viel Liebe zum Detail in der Konstruktion. Zeigen Sie
Ihr Selbstvertrauen und zeigen Sie Ihren einzigartigen Sinn für Mode
mit dieser unverzichtbaren Puffy Green Jacket.
audio-thumbnail
Jacket german
0:00
/0:44

Mastering the magic: Navigating the API

🤖
We’re updating and improving our API all the time! Check the SceneXplain API page for the latest specification.

In the real world, we're not going to spend our time clicking through a web interface to upload images. Rather, we'd use SceneXplain's API. Let's assume we have products.csv in the following format:

"name", "image_url"
"green puffy jacket", "https://images.pexels.com/photos/2529172/pexels-photo-2529172.jpeg"
"long pink coat", "https://images.pexels.com/photos/2043590/pexels-photo-2043590.jpeg"
"white sneakers", "https://images.pexels.com/photos/2946956/pexels-photo-2946956.jpeg"
🗃️
In a real-world CSV you'd have a lot more fields and records. We're just keeping it simple for this example.

The CSV contains links to the following images:

Let’s load up that file and specify our question for each image based on a template QUESTION:

import csv
import http.client
import json
from pprint import print

CSV_FILE = 'products.csv'
QUESTION = 'Write a detailed product description for the {} in this image, focusing on color, style, cut and fit (if mentioned)'

data = []

with open(CSV_FILE) as file:
    products = csv.DictReader(file)

    for product in products:
        product['image'] = product["image_url"]
        product['question'] = QUESTION.format(product['name'])
        product['features'] = ['tts', 'question_answer']
        product['algorithm'] = 'Comet'
        product['languages'] = ['en']
        product['style'] = 'concise'
        data.append(product)

data = {'data': data}

And here’s the code to send that data over to SceneXplain, then convert the result to JSON:

headers = {
    'x-api-key': f'token {YOUR_GENERATED_SECRET}',
    'content-type': 'application/json',
}

connection = http.client.HTTPSConnection('us-central1-causal-diffusion.cloudfunctions.net')
connection.request('POST', '/describe', json.dumps(data), headers)
response = connection.getresponse()

response_data = response.read().decode('utf-8')

connection.close()

data = json.loads(response_data)

Finally, we can write the information we want to a new CSV file:

# extract just the info we want
records = []
for product in data['result']:
    record = {}
    record['name'] = product['name']
    record['image_url'] = product['image_url']
    record['desc_en'] = product['i18n']['en']
    record['audio_en'] = product['tts']['en']
    records.append(record)

# write to CSV
field_names = list(records[0].keys())

with open('products-updated.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)

    writer.writeheader()
    writer.writerows(records)

Perfect! Now we have our updated products-updated.csv, containing not just the product name and image URL but also detailed product descriptions and audio transcripts:

name,image_url,desc_en,audio_en
green puffy jacket,https://images.pexels.com/photos/2529172/pexels-photo-2529172.jpeg,"The green puffy jacket in this image appears to be a vibrant and eye-catching shade of green. It is described as being a ""jacket,"" which suggests that it is worn as an outer layer. The use of the word ""puffy"" indicates that it is likely a down jacket or has some sort of insulation, which would make it warm and cozy. It is not specified whether it has a specific style or cut, but based on the rest of the woman's outfit, it seems to be a casual, relaxed fit. Overall, the green puffy jacket adds a pop of color to the image and would likely be a functional and stylish choice for cold weather.",https://storage.googleapis.com/causal-diffusion.appspot.com/audios%2Feeff4097-a337-4291-b7e7-d0b2e977589b%2Fen.wav
long pink coat,https://images.pexels.com/photos/2043590/pexels-photo-2043590.jpeg,"The pink coat in the image appears to be a long overcoat, likely made from wool or a wool blend material. The color is a soft shade of pink, which complements the black dress worn by the woman. The style of the coat is slightly oversized and features a classic notch collar. With its lengthy design, the coat would provide ample coverage during colder months. Unfortunately, the description does not provide information on the cut or fit of the coat.",https://storage.googleapis.com/causal-diffusion.appspot.com/audios%2F16f03ec8-7164-4e1e-b63e-0895159a1320%2Fen.wav
white sneakers,https://images.pexels.com/photos/2946956/pexels-photo-2946956.jpeg,"The white sneakers in the image are clean and maintained in pristine condition. They feature a blue star design on top, adding a touch of color to an otherwise minimalist look. The cut and fit of the shoes are not mentioned in the description. However, based on their appearance, they seem to be low-cut and have a narrow profile, suggesting a sleek and modern style. The shoes also bear a logo on the side, which could indicate a particular brand or designer. Overall, these white sneakers are an excellent choice for someone looking for a fresh and stylish addition to their footwear collection.",https://storage.googleapis.com/causal-diffusion.appspot.com/audios%2Ff42db823-5e1c-4dc1-bab2-269cfe2ca51f%2Fen.wav

And just like that, with a few lines of code and the power of SceneXplain's API, you've transformed your product images into both text and audio narratives. This not only enhances your site's accessibility but also widens the gateway to a more diverse audience. Now, more customers than ever before can experience your products, proving once again that inclusivity is good for business.

Jumpstart your journey with SceneXplain voiceover

Embarking on your SceneXplain journey couldn't be simpler. Just make your way over to scenex.jina.ai, sign up, and you'll be greeted with complimentary credits to get you started.

SceneXplain - Explore image storytelling beyond pixels
Leverage GPT-4 & LLMs for the most advanced image storytelling. Explain visuals for content creators, media, & e-commerce with rich captions, multilingual support, and seamless API integration. Experience the future of image description today.

Got questions? Get in touch on our Discord community and we’ll be happy to help!