Structuring ATC Transmissions: A Review
A summary of to-date research into pilot-controller communications speech-to-text.
Several university and private sector organizations have published research into Air Traffic Control speech-to-text. This is a meta-analysis of the research to-date. Importantly, much of this research is outdated, as much of it was conducted pre-2021.
Noisy Audio Environment
Challenge
ATC communications occur over VHF radio channels (108-137Mhz) with significant background noise and interference. The average signal-to-noise ratio (SNR) in ATC recordings ranges from 5-15 dB, compared to 20-30 dB in typical ASR datasets1, representing a real increase of noise of approximately 3100%. This low SNR is due to various factors:
Aircraft noise
Electromagnetic interference from equipment
Atmospheric conditions affecting radio transmission
Multiple speakers sharing the same frequency, and noise from adjacent frequency bands (notably, FM broadcast radio)
Solutions
Noise-robust Acoustic Modeling: Multi-condition training exposed models to various noise conditions. Studies showed this reduced word error rates (WER) by 20-30% in noisy environments.2 This approach involved:
Augmenting clean speech data with various noise types
Training on a mix of clean and noisy samples
Using noise-aware training techniques that estimate noise conditions during inference
Speech Enhancement: Techniques like spectral subtraction and Wiener filtering improved SNR by 3-5 dB, leading to a 10-15% reduction in WER.3 Advanced methods included:
Deep neural network-based speech enhancement
Time-frequency masking techniques
Adaptive noise cancellation algorithms
End-to-End Models: Models like wav2vec 2.0, when fine-tuned on domain-specific ATC data, showed WER reductions of up to 25% compared to traditional ASR systems in noisy conditions.4 Key advantages included:
Learning noise-robust features directly from raw waveforms
Leveraging self-supervised pre-training on large unlabeled datasets
Adapting to domain-specific noise characteristics through fine-tuning
Accented and Non-native English Speech
Challenge
English is the global aviation communications language. Still, ATC communications involve speakers from over 190 countries, leading to a wide range of accents and English proficiency levels.5 Specific issues include:
Variation in pronunciation of key terms and numbers
Differences in intonation and stress patterns
Non-standard grammatical structures used by non-native speakers
Solutions
Multi-accent Training Data: The ATCO2 corpus includes recordings from 12 countries and 10 airports, providing a diverse dataset for training accent-robust models.6 Benefits include:
Exposure to a wide range of accent variations
Improved generalization to unseen accents
Better handling of pronunciation variations for critical ATC terms
Accent Adaptation: Techniques using i-vectors or x-vectors as additional input features have shown WER reductions of 10-15% for accented speech.7 Methods used:
Capturing speaker-specific characteristics in a low-dimensional space
Allowing the ASR model to adapt to individual speaker traits
Improving recognition of accented pronunciations
Multilingual Models: wav2vec-XLSR, trained on 53 languages, demonstrated a 20% relative WER reduction on accented English compared to monolingual models.8 Advantages of this approach:
Learning language-independent speech representations
Improved robustness to different phonetic systems
Better handling of code-switching and foreign words in ATC communications
Domain-specific Vocabulary and Phraseology
Challenge
ATC communications follow ICAO phraseology, including specialized terminology and alphanumeric codes not found in general-purpose ASR training data.9 Specific challenges:
Unique callsigns and waypoint names
Standardized phrases with precise meanings
Alphanumeric codes for runways, flight levels, and headings
Solutions
Custom Language Models: Language models trained specifically on ATC communications reduced perplexity by 40-50% compared to general-purpose models.10 Techniques:
N-gram models trained on large corpora of ATC transcripts
Neural language models fine-tuned on ATC data
Incorporating ATC grammar rules into the language model structure
Contextual Information Integration: Incorporating flight plan data and valid callsign lists through lattice rescoring or shallow fusion has showed WER reductions of 15-20% for critical ATC entities.11 This involved:
Real-time integration of flight data into the ASR decoding process
Dynamically updating language model probabilities based on active flights
Biasing the ASR output towards valid ATC commands and callsigns
Named Entity Recognition (NER): ATC-specific NER models achieved F1 scores of over 0.95 for entities like callsigns and commands, significantly improving information extraction from ASR output.12 Key aspects:
Training on large datasets of annotated ATC transcripts
Using domain-specific entity types (e.g., callsign, altitude, heading)
Integrating NER with ASR to improve recognition of critical entities
Real-time Processing Requirements
Challenge
ATC communications require real-time processing, with maximum acceptable latency typically under 500 milliseconds.13 This constraint is due to:
Safety-critical nature of ATC communications
Need for immediate response to instructions
High volume of communications in busy airspace
Solutions
Model Compression: Techniques like knowledge distillation and pruning achieved 5-10x model size reduction with less than 2% WER increase.14 Specific methods:
Teacher-student training for knowledge distillation
Iterative pruning of less important network connections
Quantization of model weights to reduce memory footprint
Efficient Architectures: Time-Delay Neural Networks (TDNNs) and Conformer models showed 30-40% inference speed improvements over traditional RNN-based models.15 Key features:
Parallelizable computations for faster processing
Efficient modeling of long-range dependencies
Reduced computational complexity compared to recurrent architectures
Hardware Acceleration: Using GPUs or TPUs for inference can provide 10-20x speedup compared to CPU-only processing, enabling real-time performance for more complex models.16 Considerations:
Optimizing models for specific hardware accelerators
Utilizing low-precision arithmetic for faster computation
Leveraging batching techniques to maximize throughput
Limited Availability of Large Annotated Datasets
Challenge
Annotating ATC communications requires domain experts and is time-consuming. Historically, 1 hour of raw audio required 8-10 hours to transcribe accurately.17 Issues:
Need for expertise in ATC phraseology and procedures
Difficulty in accurately transcribing noisy recordings
High cost of employing domain experts for annotation
Solutions
Semi-supervised Learning: Leveraging large amounts of unannotated data alongside smaller annotated datasets showed WER reductions of 15-20% compared to purely supervised approaches.18 Techniques:
Self-training with confidence-based pseudo-labeling
Consistency regularization across different augmentations
Leveraging pretrained models for better initialization
Transfer Learning: Pre-training on large general-purpose speech datasets (e.g., LibriSpeech with 960 hours) and fine-tuning on smaller ATC datasets (50-100 hours) achieved comparable performance to models trained on 500+ hours of in-domain data.19 Benefits included:
Leveraging general speech patterns learned from large datasets
Reducing the amount of in-domain data required for good performance
Faster convergence during fine-tuning on ATC data
Data Augmentation: Techniques like speed perturbation and SpecAugment provided a relative WER reduction of 10-15% by artificially increasing training data diversity.20 Methods included:
Time stretching and pitch shifting of audio
Masking of time and frequency bands in spectrograms
Adding simulated background noise to clean recordings
Multilingual Communications
Challenge
In some regions outside the US, ATC communications involve code-switching between English and local languages, often within the same conversation.21 Specific issues included:
Mid-utterance language switches
Mixing of vocabulary from multiple languages
Variations in grammar and sentence structure across languages
Solutions
Multilingual ASR Models: wav2vec-XLSR demonstrated the ability to handle code-switched speech with only a 5-10% WER increase compared to monolingual speech.22 Key features included:
Joint training on multiple languages
Shared representations across languages
Language-agnostic acoustic modeling
Language Identification: Implementing a language identification step before ASR showed accuracy rates of over 95% for 10-second audio segments.23 Techniques included:
i-vector based language identification
Deep neural network classifiers for language detection
Utilizing both acoustic and linguistic features for identification
Code-switching Aware Models: Models explicitly trained on code-switched data showed WER reductions of 20-25% compared to monolingual models on code-switched speech.24 Approaches included:
Using language-specific branches in the model architecture
Incorporating language identification into the ASR decoding process
Training on synthetic code-switched data to improve robustness
Speaker Role Identification
Challenge
Distinguishing between controller and pilot speech is crucial for many downstream NLP tasks in ATC.25 Importance stems from:
Different phraseology used by controllers and pilots
Need to accurately attribute instructions and readbacks
Role-specific information extraction for situational awareness
Solutions
Joint ASR and Speaker Role Detection: End-to-end models performing both tasks simultaneously achieved role classification accuracy of over 95% while maintaining competitive ASR performance.26 Advantages included:
Leveraging acoustic and linguistic cues for role identification
Reduced overall system complexity
Potential for improved ASR accuracy through role-aware modeling
Text-based Speaker Role Classification: BERT-based models fine-tuned on ATC transcripts showed F1 scores of 0.97+ for speaker role classification.27 Key aspects included:
Utilizing pre-trained language models for ATC domain adaptation
Capturing long-range dependencies in ATC communications
Leveraging large unlabeled ATC text data for pre-training
Acoustic-based Diarization: Speaker diarization techniques using x-vector clustering achieved diarization error rates below 5% on ATC recordings.28 Methods involved:
Extracting speaker-discriminative embeddings (x-vectors)
Clustering similar embeddings to identify distinct speakers
Applying domain-specific constraints (e.g., expected number of speakers)
Handling Read-back Errors
Challenge
Read-back errors, where a pilot incorrectly repeats an instruction, occur in approximately 1% of ATC communications but can have severe safety implications.29 Specific issues include:
Misheard numbers (e.g., altitude, heading)
Confusion between similar-sounding instructions
Partial readbacks omitting critical information
Solutions
Sequence-to-sequence Models: These models showed the ability to detect up to 90% of read-back errors in controlled experiments.30 Approaches included:
Encoder-decoder architectures for instruction-readback alignment
Attention mechanisms to focus on critical parts of the instruction
Training on large datasets of instruction-readback pairs
Intent Recognition: NLU models designed for intent extraction achieved F1 scores of 0.92+ for identifying mismatches between controller instructions and pilot read-backs.31 Techniques involved:
Fine-grained intent classification for ATC instructions
Slot filling to extract specific parameters (e.g., altitude, heading)
Comparing extracted intents between instruction and readback
Rule-based Systems: Combining ASR with rule-based error detection detected up to 80% of common read-back errors in real-world ATC data.32 Methods included:
Defining formal grammars for valid ATC instructions
Implementing logic for common error types (e.g., number transposition)
Using fuzzy matching to account for ASR errors
Handling Out-of-Vocabulary Words
Challenge
ATC communications often include callsigns and waypoints not present in the ASR system's vocabulary, leading to transcription errors.33 Issues include:
Constantly changing set of active callsigns
Location-specific waypoint names
Airline-specific codes and abbreviations
Solutions
Subword Modeling: Using byte-pair encoding (BPE) or wordpieces reduced OOV rates by 40-50% compared to whole-word models.34 Benefits included:
Ability to construct unseen words from subword units
Improved handling of compound words and abbreviations
Reduced vocabulary size while maintaining coverage
Open-vocabulary ASR: Character-based models showed a 20-30% reduction in errors related to OOV words compared to word-based models.35 Advantages included:
Unlimited vocabulary size
Better handling of spelling variations and typos
Improved recognition of alphanumeric codes
Dynamic Vocabulary Adaptation: Incorporating real-time flight information to update ASR vocabularies reduced callsign recognition errors by up to 40%.36 Techniques involved:
Real-time integration with air traffic management systems
Dynamically updating language model probabilities for active callsigns
Constrained decoding based on expected callsigns in the airspace
Maintaining High Accuracy Across Different ATC Domains
Challenge
ATC communications vary significantly between airspace types (e.g., en-route vs. approach) and airports, impacting ASR performance.37 Variations include:
Different phraseology and procedures
Airport-specific waypoints and navigational references
Varying traffic patterns and communication density
Solutions
Domain Adaptation: Techniques like adversarial training showed WER reductions of 15-20% when adapting models to new ATC domains.38 Methods included:
Gradient reversal layers for domain-invariant feature learning
Unsupervised adaptation using unlabeled target domain data
Incremental learning to continuously adapt to new domains
Multi-domain Training: Models trained on data from multiple ATC domains (e.g., 5+ airports) showed 10-15% lower WER on unseen domains compared to single-domain models.39 Advantages included:
Improved generalization to new airports and airspace types
Robustness to variations in phraseology and procedures
Better handling of domain-specific vocabulary
Few-shot Learning: Techniques allowing models to adapt to new domains with 1-2 hours of data achieved 90% of the performance of models trained on 50+ hours of domain-specific data.40 Approaches included:
Meta-learning algorithms for rapid adaptation
Prototypical networks for learning from few examples
Transfer learning with domain-specific fine-tuning
Conclusion
While outdated, previous research on ATC audio speech-to-text helps provide an understanding of the challenges, and can serve as inspiration for future techniques.
Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.
Zuluaga-Gomez, J., et al. (2021). "Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems." arXiv:2104.03643v2.
Ibid.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.
Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.
Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Zuluaga-Gomez, J., et al. (2020). "Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications." arXiv:2007.12319v2.
Ibid.
Nigmatulina, I., et al. (2022). "A two-step approach to leverage contextual data: speech recognition in air-traffic communications." arXiv:2203.14960v1.
Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.
Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.
Ibid.
Ibid.
Ibid.
Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.
Zuluaga-Gomez, J., et al. (2021). "Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems." arXiv:2104.03643v2.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.
Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Szöke, I., et al. (2021). "Detecting English Speech in the Air Traffic Control Voice Communication." arXiv:2107.11509v1.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Zuluaga-Gomez, J., et al. (2023). "BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications." arXiv:2301.10148v2.
Ibid.
Ibid.
Ibid.
Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.
Ibid.
Ibid.
Ibid.
Zuluaga-Gomez, J., et al. (2020). "Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications." arXiv:2007.12319v2.
Ibid.
Ibid.
Nigmatulina, I., et al. (2022). "A two-step approach to leverage contextual data: speech recognition in air-traffic communications." arXiv:2203.14960v1.
Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.
Ibid.
Ibid.
Ibid.