DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis

Speaker recognition is the task of determining a person's identity from their voice. The human voice, as a biometric modality, is a combination of physiological and behavioral characteristics. The voice production system's physical traits determine the human voice's physiological characteristics, while the prosodic (pitch, timbre) and high-level (lexicon) traits impart the human voice's behavioral characteristics. In this work, we develop deep learning-based models for extracting speaker-dependent behavioral speech characteristics. These behavioral characteristics are then combined with speaker-dependent behavioral speech characteristics to improve speaker recognition performance in degraded audio signals. We further design a deep learning-based speech synthesis framework that uses these behavioral speech characteristics for generating highly realistic synthetic speech samples. Some examples of this method are given below. We also make the code for this project available at on GitHub.

Speech Synthesis Experiment 1:

The below-given speech samples demonstrate the proposed DeepTalk method's ability to generate high-quality realistic synthetic speech using the target speaker's reference audio and a target text utterance. We also compare our results with synthetic speech generated using a baseline Tacotron2 model.

Target Text: In a scene that played out multiple times over the weekend and into Tuesday afternoon, the California National Guard airlifted hundreds of civilians

Target Speaker	Reference Audio	Synthetic Audio (Baseline)	Synthetic Audio (DeepTalk)
Speaker 1 Male
Speaker 2 Female
Speaker 3 Male
Speaker 4 Female

Speech Synthesis Experiment 2:

Copy Synthesis Experiment - In these experiments, the target text is the same as the text utterance in the original reference audio. Therefore, an ideal synthetic speech sample should perfectly recreate the content and vocal style in the reference audio.

Example 1:

Target Text: Ingham county had recorded 695 covid-19 cases as of saturday morning and increase of 21 cases since approximately 24 hours before

Reference Audio	Synthetic Audio (Baseline)	Synthetic Audio (DeepTalk)

Example 2:

Target Text: This all comes following a recent nationwide study by retail analytics company first insight found that malls ranked last among locations where consumers say they will feel safe shopping

Reference Audio	Synthetic Audio (Baseline)	Synthetic Audio (DeepTalk)

A. Chowdhury, A. Ross, P. David, “DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis,” Proc. of the 46th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), June 2021.