Finetuner 0.7.7 Update

Finetuner makes neural network fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure requirements in the cloud.

Black background with a "finetuner" logo on the left and neon "FEATURE UPDATE" text on the right with a "READ MORE" tag below

Finetuner makes neural network fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure requirements in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models and make them production-ready without expensive hardware.

Release v0.7.7 · jina-ai/finetuner
Release Note Finetuner 0.7.7This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.This release contains 2 new features, 2 refactorings,…

This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.

This release contains 2 new features, 2 refactorings, 3 bug fixes, and 1 documentation improvement.

🆕 Features

Training data synthesis (#715)

In this release of Finetuner, we have introduced a training data synthesis feature. This feature is particularly useful for users in the e-commerce domain, who may have difficulty obtaining enough labeled training data.

This feature allows you to use historical queries collected from your search system, along with your articles, to generate training data:

import finetuner
from finetuner.model import synthesis_model_en

synthesis_run = finetuner.synthesize(
    query_data='finetuner/xmarket_queries_da',
    corpus_data='finetuner/xmarket_corpus_da',
    models=synthesis_model_en,
)

Once the synthesis job is done, you can get the training data with:

train_data_name = synthesis_run.train_data

And then, you can continue fine-tuning your embedding model with the generated training data:

training_run = finetuner.fit(
    model='bert-base-en',
    train_data=synthesis_run.train_data,
    loss='MarginMSELoss',
    ...,
)

Evaluation on multiple datasets in EvaluationCallback

In order to facilitate the training and evaluation of large language models (LLMs) using Finetuner, we have made significant changes to EvaluationCallback.

These changes now enable evaluation on multiple datasets. Users can now use the caption parameter to EvaluationCallback to get output that labels which dataset each evaluation corresponds to:

import finetuner
from finetuner.callback import EvaluationCallback

finetuner.fit(
    ...,
    callbacks=[
        EvaluationCallback(
            query_data='query-1',
            index_data='index-1',
            caption='dataset-1',
        ),
        EvaluationCallback(
            query_data='query-2',
            index_data='index-2',
            caption='dataset-2',
        ),
    ]
)

⚙ Refactoring

Display small loss values with higher precision.

To avoid displaying "0.000" for very small loss values, the display precision has been increased.

Filter PIL debugging messages from logging stack.

In order to enhance the readability of the logs, we have excluded debugging messages generated by the PIL package.

🐞 Bug Fixes

No longer overestimate the batch_size for text models.

This pull request resolves a bug where the batch size finder would incorrectly overestimate the maximum usable batch size for text models like BERT. This is likely to happen when users fine-tune the bert-base-en model without specifying batch_size.

Fix division by None error in EvaluationCallback.

Runs set up with automatic batch-size configuration and automatic evaluation callback previously passed the value None to EvaluationCallback as batch_size. This resulted in a division by None error.

Filter out queries that do not have any matches in EvaluationCallback.

When there are queries in the evaluation data which do not have any matches, Finetuner was previously unable to calculate any metrics, which leads to division by zero errors. It has been fixed in this release.

📗 Documentation Improvements

Add a tutorial for data synthesis (#745)

We have provided a tutorial for the new data synthesis module.

🤟 Contributors

We would like to thank all contributors to this release: