As AI-generated voice fraud accelerates - with 1 in 4 calls reviewed by Hiya now containing AI-generated audio - the ability to reliably detect synthetic speech has never been more critical. Hiya's deepfake voice detection technology, built on years of research in speech and signal processing, is now independently validated as the top-performing system on one of the field's most rigorous public benchmarks.
As of February 2026, Hiya's speech deepfake detection system ranks #1 in the Average Result and #2 in the Pool Result on the Speech Deepfake Arena Leaderboard hosted on Hugging Face (see Figure 1). The Pool Result is the leaderboard's primary ranking metric, computed over all evaluation samples combined, while the Average Result - where Hiya holds the top position - reflects balanced generalization across all 14 datasets with equal weight. Our model achieves this with only 1 billion parameters - three times smaller than the next leading system's 3 billion - and operates at approximately 8× real-time speed in streaming mode, demonstrating that efficiency and accuracy are not mutually exclusive, particularly under the real-world telephony conditions central to Hiya's deployment environment.
Equal Error Rate of Top 3 Leading Systems (Lower is Better)
Figure 1. Comparison of Average and Pool Equal Error Rate (EER) for the top three systems on the Hugging Face Speech Deepfake Arena. Hiya achieves the lowest Average EER, demonstrating strong and balanced generalization across all 14 evaluation datasets.
Beyond leaderboard position, it is also important to consider the tradeoff between detection performance and system size. Figure 2 illustrates the system size of Hiya and the next leading system in the benchmark.
System Size (Lower is Better)
Figure 2. Model size (number of parameters) for the two highest-ranked systems in the Speech Deepfake Arena. Hiya achieves the best Average EER with a model that is three times smaller than the next leading system.
Key results at a glance:
| Metric | Hiya's Score |
|---|---|
| Average EER Ranking | #1 |
| Pool EER Ranking | #2 |
| Average EER | 2.113% |
| Metric | Hiya's Score |
| Average Accuracy | 97.88% |
| Average F1-Score | 0.954 |
| Model Size | ~1B parameters |
| Processing Speed | 8x real-time |
| Datasets Evaluated | 14 |
| #1 rankings (individual datasets) | 4 of 14 |
What Is the Speech Deepfake Arena?
The Hugging Face Speech Deepfake Arena is an open, continuously updated benchmark for evaluating how accurately AI systems distinguish real human speech from synthetic or manipulated audio. For enterprises, carriers, and security teams evaluating deepfake detection vendors, it provides one of the most transparent and reproducible comparison frameworks available today.
Who Built It
The Speech Deepfake (DF) Arena is an academic initiative developed and maintained by an international collaboration of researchers. It is described in the paper "Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models" (arXiv).
The contributing institutions include Tallinn University of Technology (Estonia), Mohamed bin Zayed University of Artificial Intelligence (UAE), Idiap Research Institute (Switzerland), CNRS/IRISA (France), and Validsoft Ltd. (UK).
The Arena is part of a broader movement toward open, standardized AI benchmarking. Similar arena-style leaderboards exist for image deepfake detection, large language models, automatic speech recognition, and text-to-speech systems. These initiatives aim to provide transparent, reproducible evaluation frameworks that foster fair comparison and accelerate progress across the field.
Leaderboard: Speech Deepfake Arena on Hugging Face
The Detection Challenge: Real vs. Synthetic Speech
The core task is binary classification - distinguishing bona fide (real) human speech from synthetic or manipulated speech generated through text-to-speech (TTS), voice conversion (VC), and other signal-level manipulations including neural vocoder re-synthesis and neural codec-based transformations.
This is the same challenge that arises in real-world voice fraud scenarios: a caller claims to be a CEO, a family member, or a bank representative, and the system must determine whether the voice is authentic or AI-generated - often in real time, over degraded telephone channels.
Evaluation Datasets: 14 Benchmarks Spanning Real-World Conditions
The Speech Deepfake Arena brings together 14 datasets designed to represent real-world attack scenarios, established academic benchmarks, and emerging generative speech technologies. In total, the collection comprises more than 2 million audio files.
ASVspoof Challenge Series
ASVspoof2019, ASVspoof2021 LA (Logical Access), ASVspoof2021 DF (DeepFake), and ASVspoof2024 - established anti-spoofing benchmarks covering diverse synthesis methods, environments, transmission channels (including narrowband telephone, VoIP codecs, and digital communication pipelines), compression artifacts, and adversarial attacks.
ADD Challenge Series
ADD2022 Track 1, ADD2022 Track 3, ADD2023 Round 1, and ADD2023 Round 2 - challenging recording conditions with varying levels of noise and audio quality, including digital and narrowband degradation.
Real-World Dataset
In-The-Wild: real-world YouTube samples representing uncontrolled environments and realistic background noise - the conditions most representative of what detection systems face in production.
Vocoder-Based Synthesis
LibriSeVOC: audio processed through modern neural vocoders common in text-to-speech systems.
Academic Datasets: State-of-the-Art TTS and Voice Conversion
DFADD, SONAR, and Fake or Real - research-grade generative systems under clean conditions.
Neural Codec-Based Processing
CodecFake: audio processed through neural codecs used in modern generative and communication systems.
Together, these datasets span controlled and real-world conditions, clean and noisy recordings, classical spoofing attacks, neural generation pipelines, and codec-based processing.
How Detection Performance Is Measured
Speech deepfake detection is fundamentally a binary classification problem: each audio sample must be classified as either bona fide (genuine human speech) or deepfake (synthetic or manipulated speech). In the Arena's evaluation protocol, bona fide speech is defined as the positive class, meaning all metrics are computed with respect to correctly identifying genuine human speech.
Equal Error Rate (EER) - The Primary Metric
Equal Error Rate is the standard metric in anti-spoofing research. It is defined as the operating point where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). A lower EER indicates better separation between genuine and synthetic speech score distributions. Graphically, a lower EER corresponds to less overlap between the real and fake distributions.
Accuracy
Accuracy measures the percentage of correctly classified samples across both classes (bona fide and deepfake). While intuitive, accuracy alone does not fully capture the trade-off between false acceptances and false rejections.
Accuracy = Correct Predictions / Total Samples
F1-Score
The F1-score is the harmonic mean of Precision and Recall.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Because bona fide speech is the positive class, the F1-score reflects how well the system balances two critical goals: avoiding false acceptance of deepfakes as real speech, and avoiding rejection of legitimate human speech. A high F1-score indicates strong overall balance between reliability and security.
For completeness, Precision measures how often speech predicted as real is truly real (Precision = TP / (TP + FP)), while Recall measures how effectively legitimate speech is correctly accepted (Recall = TP / (TP + FN)). These component metrics are not separately reported on the leaderboard.
Average Result vs. Pool Result
The leaderboard reports two aggregate metrics that together provide complementary perspectives on robustness:
-
Average Result averages performance across datasets with equal weight per dataset, reflecting balanced generalization regardless of dataset size. Hiya ranks #1.
-
Pool Result is computed over all evaluation samples combined, where larger datasets contribute more samples. This is the leaderboard's primary ranking metric. Hiya ranks #2.
Hiya's Detailed Results Across All 14 Datasets
To provide a transparent and comprehensive view of performance, we report results at three levels: overall aggregated metrics, per-category averages, and individual dataset results. These demonstrate not only strong average performance but also consistent robustness across diverse recording conditions, synthesis methods, and processing pipelines.
Overall Aggregated Performance
| Metric | Pooled | Average |
|---|---|---|
| EER (%) | 2.324 | 2.113 |
| Accuracy (%) | 97.68 | 97.88 |
| F1-Score | 0.950 | 0.954 |
An Average EER of 2.11% across 14 datasets demonstrates stable, well-balanced discrimination performance across diverse attack types and recording conditions. Accuracy above 97.5% for both Average and Pool evaluations is an exceptional result for a benchmark that includes noisy environments, telephone and digital transmission channels, compression artifacts, and adversarial attacks designed specifically to challenge detection systems. Together with an F1 above 0.95, these metrics confirm strong performance and robustness across highly heterogeneous evaluation scenarios.
Performance by Dataset Category
| Category | Average EER (%) | Accuracy (%) | F1-Score |
|---|---|---|---|
| ASVspoof | 0.853 | 99.145 | 0.933 |
| ADD | 4.820 | 95.183 | 0.933 |
| Real World | 0.667 | 99.330 | 0.990 |
| Academic Datasets | 0.166 | 99.763 | 1.000 |
| Vocoders | 0.003 | 99.990 | 1.000 |
| Neural Codecs | 5.733 | 94.270 | 0.910 |
Average EER remains below 1% in all categories except ADD and Neural Codecs, demonstrating strong discrimination across most evaluation scenarios.
Performance is particularly strong in the Real-World (In-The-Wild) category, where the system achieves an EER of 0.667%, accuracy above 99%, and an F1-score above 0.99. This is especially significant because real-world data includes uncontrolled recording conditions, diverse microphones, and unpredictable background noise - conditions that closely resemble operational deployment environments.
The ADD datasets are considerably more challenging due to noisier and more degraded recording conditions. As audio quality deteriorates, performance naturally degrades - an expected behavior for any detection system operating under adverse acoustic environments. In real-world deployments, Hiya integrates strict quality filtering mechanisms that prevent unreliable decisions by removing extremely noisy samples, non-speech segments, only-noise recordings, and very low-quality audio where confident classification cannot be made.
For the Neural Codecs category, the results reflect a deliberate design decision. The system is intentionally not trained to treat certain neural codec transformations as inherently synthetic, since these codecs are increasingly used in modern commercial communication platforms. In real-world scenarios, genuine speech encoded with such codecs must still be recognized as bona fide. Even under these design constraints, accuracy exceeds 94% and F1-score exceeds 0.90 across all categories.
Telephony-Specific Robustness
Many datasets in the Arena include telephone transmission channels, compression artifacts, and bandwidth limitations - conditions that closely mirror real-world voice communication. This is reflected in both the ASVspoof and ADD challenges, where Hiya achieves leading results.
Hiya's system is specifically optimized for telephony use cases. Through targeted training data selection and extensive telephone channel simulation, the model is designed to perform reliably under narrowband and wideband telephone conditions, low-bitrate codecs commonly used in VoIP and cellular communications, and the transmission artifacts and signal degradation typical of live calls. Strong performance under these conditions is particularly relevant for fraud prevention, call authentication, and voice security in live communication systems.
Individual Dataset Results (EER %)
To provide deeper insight into per-dataset behavior, Figure 3 compares Equal Error Rate (EER) for the top three systems on a representative subset of seven datasets spanning real-world conditions, telephony-like channels, noisy environments, and modern generative pipelines. Full results across all 14 datasets are reported in the table below.
Figure 3. Per-dataset Equal Error Rate (EER) for the top three systems on a representative 7-dataset subset of the Speech Deepfake Arena. Source: Hugging Face Speech Deepfake Arena leaderboard (snapshot: February 2026). Visualization: Hiya.
Across these datasets, Hiya consistently ranks within the top tier and leads in several scenarios. Differences between the top systems are often narrow, reinforcing the competitiveness of the benchmark, while Hiya’s #1 Average Result reflects the strongest overall balanced generalization across all 14 datasets.
| Dataset | EER (%) | Accuracy (%) | F1-Score | #1 in EER |
|---|---|---|---|---|
| In the Wild | 0.667 | 99.33 | 0.99 | |
| ASVspoof2013 | 0.301 | 99.70 | 0.99 | |
| ASVspoof2021 LA | 1.006 | 98.99 | 0.95 | |
| ASVspoof2021 DF | 1.318 | 98.68 | 0.81 | |
| ASVspoof2024 Eval | 0.787 | 99.21 | 0.98 | |
| Fake or Real | 0.000 | 99.80 | 1.00 | ✅ |
| Codecfake | 5.733 | 94.27 | 0.91 | |
| ADD 2022 Track 1 | 12.099 | 87.90 | 0.81 | |
| ADD 2022 Track 3 | 1.188 | 98.81 | 0.96 | ✅ |
| ADD 2023 R1 | 1.976 | 98.02 | 0.99 | ✅ |
| ADD 2023 R2 | 4.006 | 96.00 | 0.97 | ✅ |
| DFADD | 0.017 | 99.95 | 1.00 | |
| LibriSeVoc | 0.003 | 99.99 | 1.00 | |
| SONAR | 0.481 | 99.54 | 1.00 |
The most challenging scenario is ADD 2022 Track 1 (12.099% EER), followed by ADD 2023 R2, both characterized by significantly degraded and noisy recording conditions. These datasets represent extreme acoustic variability and signal degradation, making them particularly demanding benchmarks for any detection system.
Despite these difficult cases, Hiya achieves strong and consistent performance across the majority of datasets. Near-perfect detection is achieved on multiple benchmarks, including Fake or Real, DFADD, and LibriSeVoc, where EER approaches zero and both Accuracy and F1-score reach their maximum values.
Hiya achieves the #1 ranking in EER on four individual datasets. Across the remaining benchmarks, performance remains highly competitive and firmly within the top tier of submitted systems.
Together, these results demonstrate consistent generalization across heterogeneous synthesis methods, transmission conditions, and real-world acoustic environments.
How Hiya Achieves These Results: Methodology and Trust Framework
Hiya's performance across diverse datasets is the result of a deliberate, research-driven development strategy. Our robustness is built on three pillars.
Continuous Architectural Innovation
Our research team continuously updates and refines system architecture to incorporate state-of-the-art advances in speech modeling and deepfake detection. Rather than relying on static models, we evolve the system in response to emerging generative techniques and newly observed attack vectors. This ensures resilience as synthesis technologies rapidly advance.
Careful Training Data Selection
High-performance detection depends critically on data quality. We apply rigorous selection criteria to ensure training data represents diverse attack types, varied and realistic recording conditions, and genuine speech from real-world communication environments.
Beyond publicly available benchmarks, Hiya leverages proprietary datasets specifically curated for telephony environments, covering narrowband and wideband telephone channels, VoIP infrastructures, and a broad variety of commercial codecs and compression schemes. This setup inherently captures key telecommunication effects - packet loss, compression artifacts, channel filtering, bandwidth limitations, and transmission distortions.
By combining public benchmarks with internally developed telephone-focused datasets, we reduce overfitting to narrow laboratory scenarios and strengthen generalization across operational environments.
Sophisticated Data Augmentation Pipeline
A key differentiator is our advanced data augmentation framework, designed to generalize to unseen conditions - particularly those encountered in telephony environments.
Our augmentation pipeline includes telephone channel simulation (with packet loss, jitter, bandwidth limitations, bit-rate variability, codec distortion, and channel filtering), a wide variety of audio codecs and compression schemes, variable recording qualities and transmission channels, background noise and music overlays, impulsive noise injection, replay attack simulation, and adversarial perturbation scenarios.
This telephony-aware augmentation strategy ensures that performance remains stable even in low-quality, bandwidth-constrained, or transmission-impaired environments - conditions that frequently arise in live communication systems.
Trustworthiness as a Core Design Principle
At Hiya, we believe the most important characteristic of any AI system is trustworthiness. Beyond raw model performance, we emphasize careful threshold selection, operational stability, and customer-adjustable detection sensitivity.
Our customers can configure the system according to their specific risk tolerance - choosing the appropriate balance between minimizing false acceptance of deepfakes and minimizing false rejection of legitimate speech.
In real-world deployment, our system includes strict quality control filters that remove extremely noisy samples, non-speech audio, only-noise recordings, and severely degraded samples where reliable classification is not possible. By abstaining from low-confidence decisions, the system preserves reliability and avoids misleading outputs.
Efficiency Meets Robustness: Why Model Size Matters
Hiya's top-tier Arena performance - including the #1 Average ranking and #2 Pool ranking - achieved with a model three times smaller than other leading systems - demonstrates that efficiency and robustness are not mutually exclusive. Our approach combines state-of-the-art detection performance with a low computational footprint and deployment-ready architectural design, alongside operational trust safeguards and adaptability to rapidly evolving generative threats.
In production environments, the system operates at approximately 8× real-time in streaming mode, meaning it can process eight seconds of audio in one second while analyzing live calls. This enables low-latency detection, horizontal scalability, and reduced infrastructure costs for enterprises and carriers deploying voice security at scale.
We recognize and appreciate the outstanding work being carried out by other research and industrial teams contributing to the Speech Deepfake Arena and the broader speech security community. The rapid progress in this field is the result of a collaborative ecosystem of researchers, engineers, and practitioners pushing the boundaries of detection robustness. Open benchmarks like the Speech Deepfake Arena play a critical role in accelerating innovation and establishing transparent performance standards. We are proud to contribute to this effort.
As deepfake technologies continue to advance, Hiya remains committed to delivering secure, trustworthy, efficient, and future-proof voice protection solutions.
Frequently Asked Questions
What is the best speech deepfake detection system in 2026?
As of February 2026, Hiya's speech deepfake detection system ranks #2 (Pool) and #1 (Average) on the Hugging Face Speech Deepfake Arena, an independent academic benchmark evaluating detection accuracy across 14 diverse datasets. The Pool Result is the leaderboard's primary ranking metric, while the Average Result reflects balanced generalization across all datasets. Hiya achieves these results with a 1B-parameter model - three times more efficient than the next leading system.
How accurate is AI voice deepfake detection?
Hiya's detection system achieves over 97% accuracy across 14 benchmark datasets that include clean recordings, noisy real-world conditions, telephone channels, and adversarial attacks. On multiple individual datasets, accuracy exceeds 99% with near-zero error rates.
Can deepfake voice detection work over phone calls?
Yes. Hiya's system is specifically optimized for telephony environments, including narrowband and wideband telephone channels, VoIP, and mobile networks. The model is trained on proprietary telephony datasets and uses extensive telephone channel simulation to maintain reliability under the compression artifacts, packet loss, and signal degradation typical of live calls.
How does the Hugging Face Speech Deepfake Arena work?
The Speech Deepfake Arena is an open academic benchmark hosted on Hugging Face that evaluates how well AI systems distinguish real human speech from synthetic or manipulated audio. It aggregates 14 datasets spanning controlled lab conditions, real-world recordings, telephone channels, and cutting-edge generative techniques. Systems are ranked by Equal Error Rate (EER), Accuracy, and F1-score.
What is Equal Error Rate (EER) in deepfake detection?
Equal Error Rate is the primary metric in speech anti-spoofing research. It represents the operating point where the false acceptance rate (mistakenly accepting a deepfake as real) equals the false rejection rate (mistakenly rejecting real speech as fake). A lower EER indicates better separation between genuine and synthetic speech - Hiya achieves an Average EER of 2.113% across 14 datasets.
Learn more about Hiya's deepfake voice detection capabilities: Hiya AI Voice Detection | Contact Us