Author: Stephen Bondurich

audio file

If Your Employees Handle Sensitive Information, Read This First

Voice cloning attacks are no longer a theoretical threat. They're already showing up in corporate fraud, wire transfer scams, and credential theft campaigns. If your organization operates in healthcare, financial services, higher education, legal services, or manufacturing, your employees are handling exactly the kind of sensitive information that makes voice-based social engineering worth a criminal's time.

Security leaders evaluating their human risk exposure, especially those responsible for compliance with HIPAA, PCI DSS, GLBA, or state-level data protection requirements, need to understand what voice cloning is, how attackers are using it, and what a realistic test of your defenses actually looks like. This post walks through all three.

Introduction

By now, most of us know how to handle a suspicious phone call from a stranger. We are inundated with robocalls and scammers who so earnestly wish to discuss our car’s extended warranty. But what if a close friend or family member called you in distress? Would you be willing to comply if the request was spoken by a voice that you’ve known for years? Could you tell if it was real or fake?

These days, probably not.

Advancements in artificial intelligence (AI) have led to the birth of a new technology that can replicate any person’s voice and use that voice to realistically produce sounds, words, and full sentences.

Enter the audio deepfake.

What are Audio Deepfakes?

Audio deepfakes, also known as voice cloning, are audio samples that mimic the speech of a specific person. More than the text-to-speech technology of old, audio deepfakes capture the speech patterns, intonations, and cadences that are unique to the individual.

Scammers are now leveraging audio deepfakes to impersonate the speech of trusted individuals for personal gain, financial gain, propaganda, and more.

With only a short audio sample, attackers can generate convincing voice replicas that sound like executives, coworkers, or trusted vendors.

These synthetic voices are increasingly used in impersonation scams designed to manipulate employees into revealing sensitive information or transferring funds.

How Does Audio Deepfake Technology Work?

The recent boom in AI capabilities is what makes the production of audio deepfakes possible. (We have been trying to simulate human speech since well before AI came around though. And the Voder was doing a pretty good job 80 years ago.) Modern algorithms can now easily replicate not only the tone and pitch of a person's voice, but also subtle nuances such as accent and emotional inflections. This level of realism makes it increasingly challenging — and, in some cases, impossible — for listeners to discern between authentic recordings and artificial ones.

Audio deepfake technology employs deep learning models trained on vast datasets of human speech. These models analyze the acoustic features and linguistic patterns of a target voice, learning to replicate them with high fidelity. Advances in neural network architectures and computational power have enabled these systems to generate seamless voice deepfakes, capable of deceiving both human listeners and automated voice recognition systems.

Originally developed for enhancing human-computer interactions and assisting communication for individuals with speech impediments, audio deepfakes are now being exploited for malicious purposes. By analyzing short audio snippets from various sources such as social media recordings, podcasts, or public speeches, attackers can synthesize new audio that sounds like a targeted individual and use it for highly convincing scams.

How Easy Is It to Create an Audio Deepfake?

In short, very easy. Creating an audio deepfake has become possible for anyone (not just Ethan Hunt). Numerous AI tools are offered online, with a range of capabilities and features that are available at price points from free to hundreds of dollars. Tools like Resemble.ai, Descript, and Realtime Voice Cloning are just a few of the many tools that are out there. 

According to a report by McAfee, cybercriminals can generate convincing voice clones with as little as three seconds of genuine audio as reference. Naturally, the more genuine audio you can provide the tool, the more realistic the deepfake speech will be. In most cases, three minutes is considered the ideal minimum of audio sampling.

 

What this means for your organization: If your executives present at conferences, appear in webinars, or post video content on LinkedIn, they have already provided attackers with more than enough raw audio to clone their voice. The attack surface is public-facing and largely unmanaged.

What Kinds of Scams are Audio Deepfakes Used For?

One of the most prevalent uses of audio deepfakes is in imposter scams. Scammers impersonate trusted individuals of the target—such as family members, colleagues, or authority figures—over phone calls. The scammer will play audio that mimics the purported caller and present an alarming scenario that requires help from the victim, usually in the form of financial payment. The audio may state that the caller is in legal trouble, medical distress, or physical danger, and that immediate payment is the only way to resolve the situation.

By manipulating victims' emotions and trust, the scammer can coerce them into divulging sensitive information or transferring money under these false pretenses. These scams prey on the vulnerability of victims who are caught off guard by the convincing nature of the fake voices.

Audio deepfakes have also been used to impersonate influential politicians and businessmen. Fake messages disseminated from these trusted figures can influence public perception on matters such as war, the stock market, public policy, and elections. Such deepfake attacks have the goal of changing mass public beliefs and are targeted to broad public audiences instead of specific individuals.

The Corporate Vishing Threat

In a business context, vishing (i.e., voice-based phishing) has become one of the most effective initial access techniques used by attackers. An employee who would recognize and delete a suspicious email may still comply with a phone call that sounds like the CFO authorizing an emergency wire transfer, or IT support requesting credentials to resolve an urgent access issue.

This is not a hypothetical. Vishing attacks have successfully bypassed multi-factor authentication, facilitated fraudulent wire transfers in the millions, and provided initial footholds that led to full ransomware deployments. Organizations in financial services, healthcare, and legal services, where urgent requests and sensitive data exchanges are routine, are disproportionately targeted.

The question security leaders need to ask is not whether their employees could be targeted this way. It's whether they've ever been tested.

How Prevalent are Audio Deepfake Scams?

Audio deepfake scams have been on the rise in the wake of the AI revolution. According to the Federal Trade Commission (FTC), impostor scams, including those involving audio deepfakes, have seen a significant increase in recent years. In 2022, over 36,000 reports were filed, resulting in financial losses exceeding $11 million. McAfee reports that in 2023, one in ten surveyed individuals were targeted by voice cloning scams. That number is expected to rise with the increased sophistication of deepfake tools. 

How Can I Protect Myself from Audio Deepfake Scams?

To protect yourself from audio deepfake scams, follow the guidance below

  1. Be skeptical: Scam scenarios are often designed to be jarring, urgent, and stressful, so as to impede rational thinking. If you get such a call, take a deep breath and proceed with skepticism.
  2. Don’t trust caller ID: Caller ID can easily be spoofed; a caller can choose any phone number they wish to appear on your caller ID. If this number is saved to a contact on your phone, that contact’s name will appear as well. Caller ID alone is not sufficient proof of identity.
  3. Hang up and call back: If you suspect caller ID spoofing, hang up the phone and initiate a phone call to a trusted phone number. A scammer cannot receive phone calls at a spoofed number.
  4. Devise a code word: Plan a safe word amongst your family or close friends to protect against impersonation scenarios. You can verify each other’s authenticity on the phone by asking for the established safe word that only the trusted group knows. This might sound excessive, but it is simple and extremely effective.
  5. Be wary of form of payment: Exercise caution if the caller requests payment via bitcoin, wire transfer, gift cards, or a cash drop off. These payment forms are preferred among attackers due to the anonymity they afford; most real emergency scenarios will at least offer some form of traceable payment.

What This Looks Like When You Test It

For organizations, individual awareness is a starting point, but it's not a security program. The only reliable way to know whether your employees, processes, and controls would hold up against a real vishing attack is to simulate one under controlled conditions.

Vishing penetration testing, also called voice-based social engineering assessment — replicates the techniques attackers actually use: caller ID spoofing, pretexting, urgency scripting, and in some cases, AI-generated voice samples. The goal isn't to trick your employees for sport. It's to identify exactly where your human layer breaks down before a real attacker does.

A social engineering assessment of this type is commonly scoped as part of a broader external penetration test, or run as a standalone engagement when organizations have specific concerns about employee susceptibility to phone-based manipulation. Results typically feed directly into security awareness training, process controls around wire transfer authorization, and privilege escalation procedures.

For security leaders at regulated organizations, particularly those navigating HIPAA Security Rule requirements, GLBA Safeguards Rule obligations, or SOC 2 audit preparation, social engineering testing is increasingly expected, not optional. If you're in the process of evaluating vendors for this type of engagement, the scope conversation matters as much as the methodology.

Closing Thoughts

Voice cloning technology has matured faster than most organizations' defenses against it. The tools are cheap, the raw audio is freely available, and the human targets are the same employees your security controls are designed to protect.

The good news is that this is a testable problem. You can know where your exposure is before an attacker finds it first.

Interested in understanding your organization's exposure to vishing and social engineering attacks? Learn about SEVN-X's penetration testing and social engineering assessment services.

 

You may also like

Phishing: Not Just a User Problem
Phishing: Not Just a User Problem
29 April, 2025

AUTHOR: MATT WILSON Yes, users click things, they always have, and (probably, maybe) always will. However, approaching p...

How To Effectively Communicate Cybersecurity To The Board
How To Effectively Communicate Cybersecurity To The Board
21 January, 2025

It has become increasingly important to have cybersecurity initiatives receive board recognition and CISOs must be able ...

Linux Privilege Escalation Vulnerability (CVE-2021-3156)
Linux Privilege Escalation Vulnerability (CVE-2021-3156)
21 January, 2025

What Happened? Hiding in the source code forsudoall these years is an unchecked exception whereby an attacker—with acces...