Structuring ATC Transmissions: A Review

A summary of to-date research into pilot-controller communications speech-to-text.

and

Aug 27, 2024

Several university and private sector organizations have published research into Air Traffic Control speech-to-text. This is a meta-analysis of the research to-date. Importantly, much of this research is outdated, as much of it was conducted pre-2021.

Noisy Audio Environment

Challenge

ATC communications occur over VHF radio channels (108-137Mhz) with significant background noise and interference. The average signal-to-noise ratio (SNR) in ATC recordings ranges from 5-15 dB, compared to 20-30 dB in typical ASR datasets1, representing a real increase of noise of approximately 3100%. This low SNR is due to various factors:

Aircraft noise
Electromagnetic interference from equipment
Atmospheric conditions affecting radio transmission
Multiple speakers sharing the same frequency, and noise from adjacent frequency bands (notably, FM broadcast radio)

Solutions

Noise-robust Acoustic Modeling: Multi-condition training exposed models to various noise conditions. Studies showed this reduced word error rates (WER) by 20-30% in noisy environments.2 This approach involved:

Augmenting clean speech data with various noise types
Training on a mix of clean and noisy samples
Using noise-aware training techniques that estimate noise conditions during inference

Speech Enhancement: Techniques like spectral subtraction and Wiener filtering improved SNR by 3-5 dB, leading to a 10-15% reduction in WER.3 Advanced methods included:

Deep neural network-based speech enhancement
Time-frequency masking techniques
Adaptive noise cancellation algorithms

End-to-End Models: Models like wav2vec 2.0, when fine-tuned on domain-specific ATC data, showed WER reductions of up to 25% compared to traditional ASR systems in noisy conditions.4 Key advantages included:

Learning noise-robust features directly from raw waveforms
Leveraging self-supervised pre-training on large unlabeled datasets
Adapting to domain-specific noise characteristics through fine-tuning

Accented and Non-native English Speech

Challenge

English is the global aviation communications language. Still, ATC communications involve speakers from over 190 countries, leading to a wide range of accents and English proficiency levels.5 Specific issues include:

Variation in pronunciation of key terms and numbers
Differences in intonation and stress patterns
Non-standard grammatical structures used by non-native speakers

Solutions

Multi-accent Training Data: The ATCO2 corpus includes recordings from 12 countries and 10 airports, providing a diverse dataset for training accent-robust models.6 Benefits include:

Exposure to a wide range of accent variations
Improved generalization to unseen accents
Better handling of pronunciation variations for critical ATC terms

Accent Adaptation: Techniques using i-vectors or x-vectors as additional input features have shown WER reductions of 10-15% for accented speech.7 Methods used:

Capturing speaker-specific characteristics in a low-dimensional space
Allowing the ASR model to adapt to individual speaker traits
Improving recognition of accented pronunciations

Multilingual Models: wav2vec-XLSR, trained on 53 languages, demonstrated a 20% relative WER reduction on accented English compared to monolingual models.8 Advantages of this approach:

Learning language-independent speech representations
Improved robustness to different phonetic systems
Better handling of code-switching and foreign words in ATC communications

Domain-specific Vocabulary and Phraseology

Challenge

ATC communications follow ICAO phraseology, including specialized terminology and alphanumeric codes not found in general-purpose ASR training data.9 Specific challenges:

Unique callsigns and waypoint names
Standardized phrases with precise meanings
Alphanumeric codes for runways, flight levels, and headings

Solutions

Custom Language Models: Language models trained specifically on ATC communications reduced perplexity by 40-50% compared to general-purpose models.10 Techniques:

N-gram models trained on large corpora of ATC transcripts
Neural language models fine-tuned on ATC data
Incorporating ATC grammar rules into the language model structure

Contextual Information Integration: Incorporating flight plan data and valid callsign lists through lattice rescoring or shallow fusion has showed WER reductions of 15-20% for critical ATC entities.11 This involved:

Real-time integration of flight data into the ASR decoding process
Dynamically updating language model probabilities based on active flights
Biasing the ASR output towards valid ATC commands and callsigns

Named Entity Recognition (NER): ATC-specific NER models achieved F1 scores of over 0.95 for entities like callsigns and commands, significantly improving information extraction from ASR output.12 Key aspects:

Training on large datasets of annotated ATC transcripts
Using domain-specific entity types (e.g., callsign, altitude, heading)
Integrating NER with ASR to improve recognition of critical entities

Real-time Processing Requirements

Challenge

ATC communications require real-time processing, with maximum acceptable latency typically under 500 milliseconds.13 This constraint is due to:

Safety-critical nature of ATC communications
Need for immediate response to instructions
High volume of communications in busy airspace

Solutions

Model Compression: Techniques like knowledge distillation and pruning achieved 5-10x model size reduction with less than 2% WER increase.14 Specific methods:

Teacher-student training for knowledge distillation
Iterative pruning of less important network connections
Quantization of model weights to reduce memory footprint

Efficient Architectures: Time-Delay Neural Networks (TDNNs) and Conformer models showed 30-40% inference speed improvements over traditional RNN-based models.15 Key features:

Parallelizable computations for faster processing
Efficient modeling of long-range dependencies
Reduced computational complexity compared to recurrent architectures

Hardware Acceleration: Using GPUs or TPUs for inference can provide 10-20x speedup compared to CPU-only processing, enabling real-time performance for more complex models.16 Considerations:

Optimizing models for specific hardware accelerators
Utilizing low-precision arithmetic for faster computation
Leveraging batching techniques to maximize throughput

Limited Availability of Large Annotated Datasets

Challenge

Annotating ATC communications requires domain experts and is time-consuming. Historically, 1 hour of raw audio required 8-10 hours to transcribe accurately.17 Issues:

Need for expertise in ATC phraseology and procedures
Difficulty in accurately transcribing noisy recordings
High cost of employing domain experts for annotation

Solutions

Semi-supervised Learning: Leveraging large amounts of unannotated data alongside smaller annotated datasets showed WER reductions of 15-20% compared to purely supervised approaches.18 Techniques:

Self-training with confidence-based pseudo-labeling
Consistency regularization across different augmentations
Leveraging pretrained models for better initialization

Transfer Learning: Pre-training on large general-purpose speech datasets (e.g., LibriSpeech with 960 hours) and fine-tuning on smaller ATC datasets (50-100 hours) achieved comparable performance to models trained on 500+ hours of in-domain data.19 Benefits included:

Leveraging general speech patterns learned from large datasets
Reducing the amount of in-domain data required for good performance
Faster convergence during fine-tuning on ATC data

Data Augmentation: Techniques like speed perturbation and SpecAugment provided a relative WER reduction of 10-15% by artificially increasing training data diversity.20 Methods included:

Time stretching and pitch shifting of audio
Masking of time and frequency bands in spectrograms
Adding simulated background noise to clean recordings

Multilingual Communications

Challenge

In some regions outside the US, ATC communications involve code-switching between English and local languages, often within the same conversation.21 Specific issues included:

Mid-utterance language switches
Mixing of vocabulary from multiple languages
Variations in grammar and sentence structure across languages

Solutions

Multilingual ASR Models: wav2vec-XLSR demonstrated the ability to handle code-switched speech with only a 5-10% WER increase compared to monolingual speech.22 Key features included:

Joint training on multiple languages
Shared representations across languages
Language-agnostic acoustic modeling

Language Identification: Implementing a language identification step before ASR showed accuracy rates of over 95% for 10-second audio segments.23 Techniques included:

i-vector based language identification
Deep neural network classifiers for language detection
Utilizing both acoustic and linguistic features for identification

Code-switching Aware Models: Models explicitly trained on code-switched data showed WER reductions of 20-25% compared to monolingual models on code-switched speech.24 Approaches included:

Using language-specific branches in the model architecture
Incorporating language identification into the ASR decoding process
Training on synthetic code-switched data to improve robustness

Speaker Role Identification

Challenge

Distinguishing between controller and pilot speech is crucial for many downstream NLP tasks in ATC.25 Importance stems from:

Different phraseology used by controllers and pilots
Need to accurately attribute instructions and readbacks
Role-specific information extraction for situational awareness

Solutions

Joint ASR and Speaker Role Detection: End-to-end models performing both tasks simultaneously achieved role classification accuracy of over 95% while maintaining competitive ASR performance.26 Advantages included:

Leveraging acoustic and linguistic cues for role identification
Reduced overall system complexity
Potential for improved ASR accuracy through role-aware modeling

Text-based Speaker Role Classification: BERT-based models fine-tuned on ATC transcripts showed F1 scores of 0.97+ for speaker role classification.27 Key aspects included:

Utilizing pre-trained language models for ATC domain adaptation
Capturing long-range dependencies in ATC communications
Leveraging large unlabeled ATC text data for pre-training

Acoustic-based Diarization: Speaker diarization techniques using x-vector clustering achieved diarization error rates below 5% on ATC recordings.28 Methods involved:

Extracting speaker-discriminative embeddings (x-vectors)
Clustering similar embeddings to identify distinct speakers
Applying domain-specific constraints (e.g., expected number of speakers)

Handling Read-back Errors

Challenge

Read-back errors, where a pilot incorrectly repeats an instruction, occur in approximately 1% of ATC communications but can have severe safety implications.29 Specific issues include:

Misheard numbers (e.g., altitude, heading)
Confusion between similar-sounding instructions
Partial readbacks omitting critical information

Solutions

Sequence-to-sequence Models: These models showed the ability to detect up to 90% of read-back errors in controlled experiments.30 Approaches included:

Encoder-decoder architectures for instruction-readback alignment
Attention mechanisms to focus on critical parts of the instruction
Training on large datasets of instruction-readback pairs

Intent Recognition: NLU models designed for intent extraction achieved F1 scores of 0.92+ for identifying mismatches between controller instructions and pilot read-backs.31 Techniques involved:

Fine-grained intent classification for ATC instructions
Slot filling to extract specific parameters (e.g., altitude, heading)
Comparing extracted intents between instruction and readback

Rule-based Systems: Combining ASR with rule-based error detection detected up to 80% of common read-back errors in real-world ATC data.32 Methods included:

Defining formal grammars for valid ATC instructions
Implementing logic for common error types (e.g., number transposition)
Using fuzzy matching to account for ASR errors

Handling Out-of-Vocabulary Words

Challenge

ATC communications often include callsigns and waypoints not present in the ASR system's vocabulary, leading to transcription errors.33 Issues include:

Constantly changing set of active callsigns
Location-specific waypoint names
Airline-specific codes and abbreviations

Solutions

Subword Modeling: Using byte-pair encoding (BPE) or wordpieces reduced OOV rates by 40-50% compared to whole-word models.34 Benefits included:

Ability to construct unseen words from subword units
Improved handling of compound words and abbreviations
Reduced vocabulary size while maintaining coverage

Open-vocabulary ASR: Character-based models showed a 20-30% reduction in errors related to OOV words compared to word-based models.35 Advantages included:

Unlimited vocabulary size
Better handling of spelling variations and typos
Improved recognition of alphanumeric codes

Dynamic Vocabulary Adaptation: Incorporating real-time flight information to update ASR vocabularies reduced callsign recognition errors by up to 40%.36 Techniques involved:

Real-time integration with air traffic management systems
Dynamically updating language model probabilities for active callsigns
Constrained decoding based on expected callsigns in the airspace

Maintaining High Accuracy Across Different ATC Domains

Challenge

ATC communications vary significantly between airspace types (e.g., en-route vs. approach) and airports, impacting ASR performance.37 Variations include:

Different phraseology and procedures
Airport-specific waypoints and navigational references
Varying traffic patterns and communication density

Solutions

Domain Adaptation: Techniques like adversarial training showed WER reductions of 15-20% when adapting models to new ATC domains.38 Methods included:

Gradient reversal layers for domain-invariant feature learning
Unsupervised adaptation using unlabeled target domain data
Incremental learning to continuously adapt to new domains

Multi-domain Training: Models trained on data from multiple ATC domains (e.g., 5+ airports) showed 10-15% lower WER on unseen domains compared to single-domain models.39 Advantages included:

Improved generalization to new airports and airspace types
Robustness to variations in phraseology and procedures
Better handling of domain-specific vocabulary

Few-shot Learning: Techniques allowing models to adapt to new domains with 1-2 hours of data achieved 90% of the performance of models trained on 50+ hours of domain-specific data.40 Approaches included:

Meta-learning algorithms for rapid adaptation
Prototypical networks for learning from few examples
Transfer learning with domain-specific fine-tuning

Conclusion

While outdated, previous research on ATC audio speech-to-text helps provide an understanding of the challenges, and can serve as inspiration for future techniques.

Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.

Zuluaga-Gomez, J., et al. (2021). "Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems." arXiv:2104.03643v2.

Ibid.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.

Zuluaga-Gomez, J., et al. (2023). "ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications." arXiv:2211.04054v2.

Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Zuluaga-Gomez, J., et al. (2020). "Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications." arXiv:2007.12319v2.

Ibid.

Nigmatulina, I., et al. (2022). "A two-step approach to leverage contextual data: speech recognition in air-traffic communications." arXiv:2203.14960v1.

Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.

Ibid.

Zuluaga-Gomez, J., et al. (2021). "Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems." arXiv:2104.03643v2.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Zuluaga-Gomez, J., et al. (2020). "Automatic Speech Recognition Benchmark for Air-Traffic Communications." arXiv:2008.11525v4.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Szöke, I., et al. (2021). "Detecting English Speech in the Air Traffic Control Voice Communication." arXiv:2107.11509v1.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Zuluaga-Gomez, J., et al. (2023). "BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications." arXiv:2301.10148v2.

Ibid.

Zuluaga-Gomez, J., et al. (2020). "Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications." arXiv:2007.12319v2.

Ibid.

Nigmatulina, I., et al. (2022). "A two-step approach to leverage contextual data: speech recognition in air-traffic communications." arXiv:2203.14960v1.

Zuluaga-Gomez, J., et al. (2023). "How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications." arXiv:2303.03285v1.

Ibid.

A guest post by

Kristian Gaylord

✈️

Eric Button's Substack

Discussion about this post