Description - XACLE Challenge

Task description

The scope of this challenge is to predict the semantic alignment of a given general audio and text pair. In a generative model, evaluating the alignment between input and output is extremely important. For instance, in the evaluation of text-to-audio generation (TTA), methods have been proposed to evaluate the alignment between audio and text objectively. However, it has been pointed out that these methods often have a low correlation with human subjective evaluations. In this challenge, our goal is to build a model that automatically predicts the semantic alignment between audio and text for TTA evaluation. The aim is to develop a method for automatic evaluation that correlate highly with human subjective evaluations.

Dataset

Training and validation data

The training and validation data consist of the following:

Audio–text pairs (7,500 pairs for training data, 3,000 pairs for validation data)
11-point semantic-alignment scores between audio and text per listener
Average semantic-alignment scores (average scores of each audio–text pair)
Listener IDs who gave the semantic-alignment scores

All audio samples were converted to mono 16-bit 16 kHz audio. Training and validation data are each evaluated by different listeners.

Training and validation data

Test data

The test data consist of 3,000 audio–text pairs. The listener ID is not included. The test data are evaluated by a set of listeners different from those for the training and validation data. The sampling rate, the number of channels, and the bit depth of audio correspond to those of audio in the training and validation data.

Test data

Task rules

Only the training data can be used for training models.
Only the validation and test data can be used for evaluating models.
Participants can use external data in addition to the data provided by the organizers, after requesting for its use to the organizers and receiving approval.
Participants can also use pre-trained models they have requested for to the organizers and received approval from them. If participants use pre-trained models, they must also declare to the organizers what data were used to train those pre-trained models.
External data and pre-trained models are limited to those that are open-access. External data are limited to openly available datasets that have already been published. It is prohibited to collect new audio samples, texts, and scores for use in model training. If you would like to use external data and pre-trained models, you need to send an e-mail to the organizers by October 10, 2025.
Each participant can submit up to four systems.

Available external data and pre-trained models

The datasets and models listed below can be used without the approval of the organizers.

Dataset or model name ▲ ▼	Type ▲ ▼	Added ▲ ▼	Link

Evaluation

Submissions will be evaluated on the basis of the correlation coefficient and score error between predicted and average-semantic-alignment scores. Specifically, the metrics include the linear correlation coefficient (LCC), Spearman's rank correlation coefficient (SRCC), Kendall's rank correlation coefficient (KTAU), and mean squared error (MSE). When \(y\) represents the average-semantic-alignment scores and \(\hat{y}\) represents the predicted scores for each audio–text pair, the evaluation metrics are calculated as follows. \[ LCC = \frac{\sum(y - m_y)(\hat{y} - m_\hat{y})}{\sqrt{\sum(y-m_y)^2 \sum(\hat{y}-m_{\hat{y}})^2}}, \] where \(m_y\) and \(m_\hat{y}\) denote the mean of the vector \(y\) and \(\hat{y}\), respectively. \[ SRCC = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}, \] \[ d = rank(y) - rank(\hat{y}), \] where \(n\) and \(rank(\cdot)\) denote the number of sample and sorting by rank. If there are ties, the average rank is assigned to each of the tied values. \[ KTAU = \frac{N_c - N_d}{\sqrt{(N_c + N_d + N_{tx})(N_c + N_d + N_{ty})}}, \] where \(N_c\), \(N_d\), \(N_{tx}\), and \(N_{ty}\) denote the number of concordant pairs where the ranks of the predicted scores and average-semantic-alignment scores, discordant pairs, tied pairs for the x-axis variable, and tied pairs for the y-axis variable. \[ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2, \] where \(N\) denotes number of sample. The purpose of this challenge is to develop a method for automatic evaluation that correlates highly with human subjective evaluations. Therefore, we will use metrics that demonstrate correlations and differences from human evaluation scores. We will provide the source code for each evaluation metric on GitHub of the baseline model.

Ranking

The final ranking is determined by the SRCC metric. If multiple teams are tied, the standings will be determined using the LCC, KTAU, and MSE metrics. The top five ranked submissions will be invited to submit a 2-page paper, with the accepted papers having the opportunity to be presented at ICASSP 2026.

Baseline method

The task organizers will provide a supervised score prediction model, similar to the baseline model of the RELATE. The baseline model consists of an audio encoder, text encoder, and an LSTM-based score predictor. A pre-trained model is used for both the audio and text encoders. Since this is the first challenge focusing on audio–text alignment, we have adopted a simpler baseline model. We will provide the source code of the baseline model on GitHub. If you use this code, cite the task description paper, which will be published in November, and the following paper.

Baseline model

Yusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito, and Hiroshi Saruwatari, RELATE: Subjective Evaluation Dataset for Automatic Evaluation of Relevance Between Text and Audio Proc. INTERSPEECH, pp. 3155-3159, 2025.

PDF

Contact

For inquiries or questions, please send an email to contact@xacle.org.