Man and machine: What makes a voice 'real' in the age of generative Al?
In forensic phonetics, voice presentation attacks have become increasingly challenging in casework practice. Due to substantial developments in generative Al in recent years, voice cloning through text-to-speech (TTS) and voice conversion (VC) methods can now achieve high degrees of naturalness and speaker similarity, disrupting a great variety of industries and posing important questions about genuine (human) creativity. This talk addresses two foundational issues pertaining to speech forensics and language science in general: (1) given that audio deepfakes reproduce what is easy to average, what linguistic residue is left behind? (2) if generative models can perfectly reproduce speech acoustics, what linguistic properties remain in the signal that still index speaker identity, social meaning, or personhood? Answering these questions about 'fake speech' (or conversely, what reifies the bona fide 'voice) has immediate applications across several fields in the near-term as well as enduring implications for linguistic theory, wider societal issues, and the metaphysics of vocal identity.
Daniel Lee is a PhD candidate in Computation, Cognition and Language in the Phonetics Laboratory at Cambridge University. His interdisciplinary research investigates fake speech-including voice impersonation and audio deepfakes-drawing on forensic phonetics, cognitive science, and artificial intelligence. Prior to his doctoral research, he received his BA (Hons) and MA from NTU Singapore, where he empirically corroborated principles of voice quality (Laver, 1980) using real-time, structural, whole-vocal-tract MRI data. An interesting confluence of these experiences raises an overarching question: what makes a 'voice' real/human?