IEEE SLT 2021 Alpha-mini Speech Challenge(ASC)

Introduction

Human-robot speech interaction (HRSI) is an indispensable skill for humanoid robots. Robots produced by UBTECH are equipped with intelligent voice interaction functions. As a global high-tech innovation enterprise integrating artificial intelligence, humanoid robot research and development, platform software development and application, and product sales, UBTECH has always been committed to smooth, efficient and friendly HRSI technology research and development, enabling every robot to listen and speak.

As the first chain of HRSI, keyword spotting (KWS) technology, (a.k.a wake-up word detection ) directly determines the experience of subsequent interactions. Meanwhile, the accuracy of sound source location (SSL) can provide essential cues for subsequent beamforming, speech enhancement and speech recognition algorithms. In home environments, the following interferences pose great challenges to HRSI: 1) various types of noises from TV, radio, other electrical appliances and human talking, 2) echoes from the loudspeaker equipped on the robot, 3) room reverberation and 4) noises from the mechanical movements of the robot (mechanical noise in short). These noise interferences complicate KWS and SSL to a great extent. Thus, robust algorithms are highly in demand.

UBTECH Technology Co., Ltd., Northwestern Polytechnical University, Idiap Research Institute, Peking University and AISHELL Foundation jointly organize the Alpha-mini Speech Challenge (ASC), providing a common benchmark for KWS, SSL and related tasks. Alpha-mini is an excellent robot produced by UBTECH, equipped with intelligent speech interaction module based on a 4-microphone array. As a flagship challenge event of the 2021 IEEE Spoken Language Technology (SLT) Workshop , ASC will provide the participants with labelled audio data recorded from Alpha-mini in real room environments, covering abundant indoor noise, echo and reverberation. It aims to promote research in actual HRSI scenarios and provide a common benchmark for KWS, SSL and related speech tasks.

The official challenge description paper can be downloaded from here.


  @inproceedings{ASC2021,
    title={IEEE SLT 2021 Alpha-mini Speech Challenge: Open Datasets, Tracks, Rules and Baselines},
    author={Fu, Yihui and Yao, Zhuoyuan and He, Weipeng and Wu, Jian and Wang, Xiong and Yang, 
            Zhanheng and Zhang, Shimin and Xie, Lei and Huang, Dongyan and Bu, Hui and Motlicek, Petr and Odobez, Jean-Marc},
    booktitle = {{IEEE SLT 2021}},
    address = {Shenzhen, China},
    year = {2021},
    month = January,
  }

Codes for baseline systems can be found from: https://github.com/nwpuaslp/ASC_baseline

Results

Table 1. Results of KWS Track
Ranking	Team ID	Organization	FRR	FAR	Score (The lower the better)
1	ASC_029	MICL, School of Computer Science and Engineering, Nanyang Technological University, Singapore	0.31	0.29	0.59
2	ASC_018	BATC Lab, Department of Electronic Engineering, Shanghai Jiao Tong University	0.32	0.44	0.75
	Baseline(Deep KWS)		0.55	0.25	0.81
3	ASC_020		0.22	0.64	0.86
4	ASC_016		0.14	0.74	0.88
5	ASC_032		0.45	0.46	0.91
6	ASC_014		0.06	0.91	0.97
7	ASC_019		0.07	0.93	1.00

Table 2. Results of SSL Track
Ranking	Team ID	ACC10	ACC7.5 (%)	ACC5 (%)	MAE (°)	Score (The higher the better)
	Baseline	27.00	18.93	11.45	66.40	18.73
1	ASC_032	16.65	12.32	9.08	64.15	12.52
2	ASC_004	12.65	9.40	6.89	74.13	9.38
3	ASC_015	6.67	4.80	3.50	88.38	4.58
4	ASC_027	6.02	4.29	3.14	88.65	4.07

Datasets

Table 1. Data to Release
Dataset	Subset	Duration(hrs)	Format	Scenario	Mic-Loudspeaker distance(metres)
Training	Keyword-Train	9.4	16kHz, 16bit, single channel wav	/	/
	Speech-Train	146.1
	Noise-Train	60.0
	Echo-Train	28.5
	Echo-Record	3.0	16kHz, 16bit, six-channel wav
	Noise-Mech	8.6	16kHz, 16bit, six-channel wav
Development	KWS-Dev	7.5	16kHz, 16bit, six-channel wav	Keyword Only	[2, 4]
				Keyword+Noise
				Keyword+Echo
				Keyword+Noise+Echo
				Keyword+Echo+Mech
	SSL-Dev	20.0		Speech Only
				Speech+Noise
				Speech+Echo
				Speech+Noise+Echo
				Speech+Echo+Mech
Evaluation	KWS-Eval	TBA	Same as Development	Same as Development	[2,5]
Evaluation	SSL-Eval	TBA	Same as Development	Same as Development	[2,5]

Keyword Spotting (KWS) Track

The data used in KWS Track is shown in Table 2. Participants can use their own room impulse response (RIR), either collected or simulated, for data augmentation to train the KWS model. Furthermore, Echo-Record and Noise-Mech are provided as the reference of time-delay of echo and mechanical noise of Alpha-mini, respectively. Participants can also use these data sets during training. KWS-Dev, SSL-Dev, KWS-Eval, SSL-Eval are six-channel recorded data. Participants can use KWS-Dev and SSL-Dev directly without any simulation to optimize the model.

Data

Table 2: Data for Keyword spotting (KWS) Track
Train	Development	Evaluation
Keyword-Train	KWS-Dev SSL-Dev	KWS-Eval SSL-Eval
Speech-Train
Noise-Train
Echo-Train
Echo-Record
Noise-Mech

Evaluation & Ranking

We use a combination of false reject rate (FRR) and false alarm rate (FAR) on KWS-Eval and SSL-Eval as the criterion of the KWS performance. Suppose the evaluation set has `N_{key}` examples with keyword and `N_{non\-key}` examples without keyword, we define FRR and FAR as follows:

`FR R=\frac{N_{FR}}{N_{Key}}, FAR=\frac{N_{FA}}{N_{non\-key}}`

where `N_{FR}` is the number of examples with keyword but the KWS system gives a negative decision and `N_{FA}` is the number of examples without keyword but the KWS system gives a positive decision. The final score of KWS is defined as:

`Sco re^{KWS}= FR R + FAR`

`FR R` and `FAR` are calculated on all examples in KWS-Eval and SSL-Eval respectively and the final rank is `Sco re^{KWS}` calculated by the equation above. The system has lower `Sco re^{KWS}` will be ranked higher.

Rules

The use of any other data that is not provided by organizers (except for RIR) is strictly prohibited. Furthermore, it is not allowed to use KWS-Dev and SSL-Dev to train the KWS model. The challenge organizers will provide participants with the topology of microphone array and loudspeaker, as well as the definition of angle. There is no limitation on KWS model structure and model training technology used by participants. The KWS model can have a maximum of 500 ms look ahead. To infer the current frame `T` (in ms), the algorithm can access any number of past frames but only 500 ms of future frames `T + 500` ms. In case there are submitted systems with the same score, the system with lower time delay will be given a higher ranking.

Submission

KWS-Eval and SSL-Eval will not be released before organizers notify the participants about the results. Participants need to provide the organizers with a docker image of a runnable KWS system. The executable file in the image needs to receive the list of data in KWS-Eval and SSL-Eval and outputs the result of KWS. The output determines whether the sample contains keyword. If keyword exists, the sample is labeled as 1, and 0 otherwise.

Sound Source Location (SSL) Track

The data that participants can use in SSL Track is shown in Table 3. Participants can also use their own RIR, either collected or simulated, for data augmentation to train the SSL model. Furthermore, Echo-Record and Noise-Mech are provided as the reference of time-delay of echo and mechanical noise of Alpha-mini, respectively. Participants can also use these data sets during training. SSL-Dev and SSL-Eval are six-channel recorded data. Participants can use SSL-Dev directly without any simulation to optimize the model.

Data

Table 3: Data for Sound Source Location (SSL) Track
Train	Development	Evaluation
Speech-Train	SSL-Dev	SSL-Eval
Noise-Train
Echo-Train
Echo-Record
Noise-Mech

Evaluation & Ranking

We use a combination of Mean Absolute Error (MAE) and accuracy (ACC) as the criterion of the SSL performance. With the list of absolute errors of angle `e_i, i=1,...,N`, where `N` is the number of examples, we compute the MAE as:

`MAE = \frac{1}{N}\sum_{i=1}^{N}e_{i}`

ACC under different tolerances `\delta` is defined as:

`AC C_{\delta}=\frac{1}{N}\sum_{i=1}^{N}a_{i}, a_{i}= 1 if e_{i} \le \delta \ else\ 0`

The final score of SSL is defined as:

`Sco re^{SSL}=0.3\times AC C_{10} +0.35 \times AC C_{7.5} `
`+ 0.35 \times AC C_{5} + (1-\frac{MAE}{MAE_{\text{baseline}}))`

The final rank is computed according to ACC under each tolerance and MAE of all examples in SSL-Eval by the equation above. The `MAE_{\text{baseline}}` of SSL-Eval will be released by the challenge organizers. The system with higher score will be ranked higher.

Rules

The use of any other data that is not provided by challenge organizers (except for RIR) is strictly prohibited. Furthermore, it is not allowed to use SSL-Dev and Keyword-Train to train the SSL model. The challenge organizers will provide participants with the topology of microphone array and loudspeaker, as well as the definition of angle. There is no limitation on the system architecture, models, training techniques and time delays. However, we encourage participants to develop models with better performance and lower time delay. In case the submitted systems with the same score, the system with lower time delay will be given higher ranking.

Submission

SSL-Eval will not be released before organizers notify the participants about the results. Participants need to provide organizers with a docker image of a runnable SSL system. The executable file in the image needs to receive the list of data in SSL-Eval and outputs the result of SSL. The output determines the direction of speech ranges from 1°to 360°. A detailed technical support of the usage and submission of docker will be provided later.

Organizing Committee

Youjun Xiong, UBTECT Technology Co., Ltd.
Lei Xie, Northwestern Polytechnical University
Huan Tan, UBTECT Technology Co., Ltd.
Dongyan Huang, UBTECT Technology Co., Ltd.
Jean-Marc Odobez, Idiap Research Institute, Switzerland

Petr Motlicek, Idiap Research Institute, Switzerland
Weipeng He, Idiap Research Institute, Switzerland
Yuexian Zou, Peking University
Hui Bu, AISHELL Foundation
Jian Wu, Northwestern Polytechnical University

Important Dates

Table 4. Important Dates
Dates	Events
September 27th, 2020	Registration due
September 30th, 2020	Release of the training and development set
November 22nd, 2020	Deadline for participants to submit docker mirror
December 6th, 2020	Organizers will notify the participants about the results
December 27th, 2020	Working note report deadline
January 19th-22nd, 2021	2021 IEEE SLT Workshop date

Application

If you are interested in the challenge, please submit the application form below. The registration deadline is September 27th. The organizing committee will review the application and verify the qualification of the teams within 5 working days. The teams that have passed the review are qualified to join the challenge. The application results will be notified via email.

Submit the Application Information Here (Registration is due)

If you have any question, please contact with slt2021_asc@163.com immediately to let us know.

The training data will be released on September 30th, and the data downloading method will be provided to the successfully-registered teams.

Statement

Multiple applications from the same team are PROHIBITED.
The use of any other data that is not provided by challenge organizers (except for RIR) is strictly PROHIBITED.
The use of development sets in any form of unallowed ways is strictly PROHIBITED, including but not limited to using the development sets to finetune or train model.
The result of the sumbitted system is invalid if any cheating is found.

Awards

First prize: An Alpha-mini robot
Second Prize: An Iron Man MARK50 robot
Third prizes: A Star Wars First Order Stormtrooper robot

First prize: An Alpha-mini robot
Second Prize: An Iron Man MARK50 robot
Third prizes: A Star Wars First Order Stormtrooper robot

MISC

Participants can choose any track. It is also welcomed to participant in both tracks. More details on this challenge will be announced soon. The right of interpretation of the challenge belongs to the organizing committee.

Should you have any questions regarding this challenge, please drop an email to: slt2021_asc@163.com.