JukeBox Version 2

Speaker recognition systems rely on ideal audio conditions such as minimal background noise, neutral speaking style, and vocal effort to achieve good performance. However, practical application scenarios often deviate from the ideal audio conditions, leading to poor speaker recognition performance. The singing voice is one such example that combines the intrinsic factors of speech variability, such as speaking style and vocal effort, with the extrinsic elements of background instrumentation and vocals. In this work, we propose a multi-task domain adaptation-based speaker recognition method robust to the singing voice's non-ideal audio conditions. The proposed method outperforms several state-of-the-art speaker recognition systems on singing voice while maintaining comparable performance on spoken voice. Detailed analysis of the speech embeddings extracted by the proposed method demonstrates their robustness to the variation in speaking style (spoken vs. singing). We also extend a publicly available singing voice dataset with corresponding spoken voice data to enable research on cross-domain speaker recognition, i.e., matching a person's singing voice to their spoken voice.
- Speaker ID 13, Original audio file 0_1.wav
- Metadata: Albert West | Dutch | male | Spoken Voice
- Speaker ID 13, Original audio file 1.wav
- Metadata: Albert West | Dutch | male | Singing Voice
- Speaker ID 363, Original audio file 0_1.wav
- Metadata: Amy Millan | English | female | Spoken Voice
- Speaker ID 363, Original audio file 1.wav
- Metadata: Amy Millan | English | female | Singing Voice