Voice Cloning: The Rising Threat of Audio Deepfakes in Impersonation Scams
Author: Stephen Bondurich
By now, most of us know how to handle a suspicious phone call from a stranger. We are inundated with robocalls and scammers who so earnestly wish to discuss our car’s extended warranty. But what if a close friend or family member called you in distress? Would you be willing to comply if the request was spoken by a voice that you’ve known for years? Could you tell if it was real or fake?
These days, probably not.
Advancements in artificial intelligence (AI) have led to the birth of a new technology that can replicate any person’s voice and use that voice to realistically produce sounds, words, and full sentences.
Enter the audio deepfake.
What are Audio Deepfakes?
Audio deepfakes, also known as voice cloning, are audio samples that mimic the speech of a specific person. More than the text-to-speech technology of old, audio deepfakes capture the speech patterns, intonations, and cadences that are unique to the individual. Scammers are now leveraging audio deepfakes to impersonate the speech of trusted individuals for personal gain, financial gain, propaganda, and more.
How does audio deepfake technology work?
The recent boom in AI capabilities is what makes the production of audio deepfakes possible. (We have have been trying to simulate human speech since well before AI came around though. And the Voder was doing a pretty good job 80 years ago.) Modern algorithms can now easily replicate not only the tone and pitch of a person's voice, but also subtle nuances such as accent and emotional inflections. This level of realism makes it increasingly challenging — and, in some cases, impossible — for listeners to discern between authentic recordings and artificial ones.
Audio deepfake technology employs deep learning models trained on vast datasets of human speech. These models analyze the acoustic features and linguistic patterns of a target voice, learning to replicate them with high fidelity. Advances in neural network architectures and computational power have enabled these systems to generate seamless voice deepfakes, capable of deceiving both human listeners and automated voice recognition systems.
Originally developed for enhancing human-computer interactions and assisting communication for individuals with speech impediments, audio deepfakes are now being exploited for malicious purposes. By analyzing short audio snippets from various sources—such as social media recordings, podcasts, or public speeches—attackers can synthesize new audio that sounds like a targeted individual and use it for highly convincing scams.
How easy is it to create an audio deepfake?
In short, very easy. Creating an audio deepfake has become possible for anyone (not just Ethan Hunt). Numerous AI tools available are offered online, with a range of capabilities and features that are available at price points from free to hundreds of dollars. Tools like Resemble.ai, Descript, and Realtime Voice Cloning are just a few of the many tools that are out there.
According to a report by McAfee, cybercriminals can generate convincing voice clones with as little as three seconds of genuine audio as reference. Naturally, the more genuine audio you can provide the tool, the more realistic the deepfake speech will be. In most cases, three minutes is considered the ideal minimum of audio sampling.
What kinds of scams are audio deepfakes used for?
One of the most prevalent uses of audio deepfakes is in impostor scams. Scammers impersonate trusted individuals of the target—such as family members, colleagues, or authority figures—over phone calls. The scammer will play audio that mimics the purported caller and present an alarming scenario that requires help from the victim, usually in the form of financial payment. The audio may state that the caller is in legal trouble, medical distress, or physical danger, and that immediate payment is the only way to resolve the situation.
By manipulating victims' emotions and trust, the scammer can coerce them into divulging sensitive information or transferring money under these false pretenses. These scams prey on the vulnerability of victims who are caught off guard by the convincing nature of the fake voices.
Audio deepfakes have also be used to impersonate influential politicians and businessmen. Fake messages disseminated from these trusted figures can influence public perception on matters such as war, the stock market, public policy, and elections. Such deepfake attacks have the goal of changing mass public beliefs and are targeted to broad public audiences instead of specific individuals.
How prevalent are audio deepfake scams?
Audio deepfake scams have been on the rise in the wake of the AI revolution. According to the Federal Trade Commission (FTC), impostor scams, including those involving audio deepfakes, have seen a significant increase in recent years. In 2022, over 36,000 reports were filed, resulting in financial losses exceeding $11 million. McAfee reports that in 2023, one in ten surveyed individuals were targeted by voice cloning scams. That number is expected to rise with the increased sophistication of deepfake tools.
How can I protect myself from audio deepfake scams?
To protect yourself from audio deepfake scams, follow the guidance below:
Be skeptical: Scam scenarios are often designed to be jarring, urgent, and stressful, so as to impede rational thinking. If you get such a call, take a deep breath and proceed with skepticism.
Don’t trust caller ID: Caller ID can easily be spoofed; a caller can choose any phone number they wish to appear on your caller ID. If this number is saved to a contact on your phone, that contact’s name will appear as well. Caller ID alone is not sufficient proof of identity.
Hang up and call back: If you suspect caller ID spoofing, hang up the phone and initiate a phone call to a trusted phone number. A scammer cannot receive phone calls at a spoofed number.
Devise a code word: Plan a safe word amongst your family or close friends to protect against impersonation scenarios. You can verify each other’s authenticity on the phone by asking for the established safe word that only the trusted group knows. This might sound excessive, but it is simple and extremely effective.
Be wary of form of payment: Exercise caution if the caller requests payment via bitcoin, wire transfer, gift cards, or a cash drop off. These payment forms are preferred among attackers due to the anonymity they afford; most real emergency scenarios will at least offer some form of traceable payment.