Enhancing McAdams Coefficient-Based Speaker Anonymisation with Cross-Gender Timbre Transfer

< back
[pdf] [presentation] [code] [demo]

Speaker anonymisation or speaker de-identification involves modifying original speech to resemble the voice of an unspecified speaker, while preserving linguistic content and speech quality. This study introduces a speaker anonymisation system based on the second baseline system from the VoicePrivacy2022 Challenge. The evaluation was performed using the evaluation scripts provided by the Challenge. Enhanced privacy is achieved by using a VAE-GAN timbre transfer model to disguise gender identity through a random gender selection strategy. Additionally, the primary objective utility evaluation shows the potential for further improvements. In the secondary utility evaluation, the proposed system shows favourable results with respect to voice distinctiveness, surpassing all baseline systems.

The overall architecture of the proposed system. The original speech is used as input and frame-by-frame computation is performed through the Stage 1 to output the result of the first anonymisation. In the Stage 2, the 128x128 mel-spectrograms of the anonymised speech after Stage 1 are extracted. Then the newly generated 128x128 mel-spectrograms are inferred by the pre-trained VAE-GAN model. Finally, the signal is reconstructed using the Fast Griffin-Lim algorithm.