By now, most of us know how to handle a suspicious phone call from a stranger. We are inundated with robocalls and scammers who so earnestly wish to discuss our car’s extended warranty. But what if a close friend or family member called you in distress? Would you be willing to comply if the request was spoken by a voice that you’ve known for years? Could you tell if it was real or fake?
These days, probably not.
Advancements in artificial intelligence (AI) have led to the birth of a new technology that can replicate any person’s voice and use that voice to realistically produce sounds, words, and full sentences.
Enter the audio deepfake.
Audio deepfakes, also known as voice cloning, are audio samples that mimic the speech of a specific person. More than the text-to-speech technology of old, audio deepfakes capture the speech patterns, intonations, and cadences that are unique to the individual. Scammers are now leveraging audio deepfakes to impersonate the speech of trusted individuals for personal gain, financial gain, propaganda, and more.
The recent boom in AI capabilities is what makes the production of audio deepfakes possible. (We have have been trying to simulate human speech since well before AI came around though. And the Voder was doing a pretty good job 80 years ago.) Modern algorithms can now easily replicate not only the tone and pitch of a person's voice, but also subtle nuances such as accent and emotional inflections. This level of realism makes it increasingly challenging — and, in some cases, impossible — for listeners to discern between authentic recordings and artificial ones.
Audio deepfake technology employs deep learning models trained on vast datasets of human speech. These models analyze the acoustic features and linguistic patterns of a target voice, learning to replicate them with high fidelity. Advances in neural network architectures and computational power have enabled these systems to generate seamless voice deepfakes, capable of deceiving both human listeners and automated voice recognition systems.
Originally developed for enhancing human-computer interactions and assisting communication for individuals with speech impediments, audio deepfakes are now being exploited for malicious purposes. By analyzing short audio snippets from various sources—such as social media recordings, podcasts, or public speeches—attackers can synthesize new audio that sounds like a targeted individual and use it for highly convincing scams.
In short, very easy. Creating an audio deepfake has become possible for anyone (not just Ethan Hunt). Numerous AI tools available are offered online, with a range of capabilities and features that are available at price points from free to hundreds of dollars. Tools like Resemble.ai, Descript, and Realtime Voice Cloning are just a few of the many tools that are out there.
According to a report by McAfee, cybercriminals can generate convincing voice clones with as little as three seconds of genuine audio as reference. Naturally, the more genuine audio you can provide the tool, the more realistic the deepfake speech will be. In most cases, three minutes is considered the ideal minimum of audio sampling.
One of the most prevalent uses of audio deepfakes is in imposter scams. Scammers impersonate trusted individuals of the target—such as family members, colleagues, or authority figures—over phone calls. The scammer will play audio that mimics the purported caller and present an alarming scenario that requires help from the victim, usually in the form of financial payment. The audio may state that the caller is in legal trouble, medical distress, or physical danger, and that immediate payment is the only way to resolve the situation.
By manipulating victims' emotions and trust, the scammer can coerce them into divulging sensitive information or transferring money under these false pretenses. These scams prey on the vulnerability of victims who are caught off guard by the convincing nature of the fake voices.
Audio deepfakes have also be used to impersonate influential politicians and businessmen. Fake messages disseminated from these trusted figures can influence public perception on matters such as war, the stock market, public policy, and elections. Such deepfake attacks have the goal of changing mass public beliefs and are targeted to broad public audiences instead of specific individuals.
Audio deepfake scams have been on the rise in the wake of the AI revolution. According to the Federal Trade Commission (FTC), impostor scams, including those involving audio deepfakes, have seen a significant increase in recent years. In 2022, over 36,000 reports were filed, resulting in financial losses exceeding $11 million. McAfee reports that in 2023, one in ten surveyed individuals were targeted by voice cloning scams. That number is expected to rise with the increased sophistication of deepfake tools.
To protect yourself from audio deepfake scams, follow the guidance below