While current Speech Language Models (SLMs) possess strong semantic understanding, their generated speech often sounds flat and fails to convey the underlying communicative intent. We term this the semantic understanding–acoustic realization gap—a systematic failure to translate what the model thinks into how it speaks.
To bridge this gap, we introduce SA-SLM (Self-Aware Speech Language Model). SA-SLM is built on the principle that the model is aware of what it thinks during generation and how it speaks during training, thereby continuously aligning its intent with its realization through two core innovations: (1) Intent-Aware Bridging via a Variational Information Bottleneck (VIB) to extract stable expressive intent, and (2) Realization-Aware Alignment, a closed-loop self-reward mechanism where the model acts as its own critic. Trained on just 800 hours of data, our 3B model approaches GPT-4o-Audio in expressiveness.
To bridge this gap, we introduce SA-SLM (Self-Aware Speech Language Model). SA-SLM is built on the principle that the model is aware of what it thinks during generation and how it speaks during training, thereby continuously aligning its intent with its realization through two core innovations: (1) Intent-Aware Bridging via a Variational Information Bottleneck (VIB) to extract stable expressive intent, and (2) Realization-Aware Alignment, a closed-loop self-reward mechanism where the model acts as its own critic. Trained on just 800 hours of data, our 3B model approaches GPT-4o-Audio in expressiveness.

- VIB-driven Modulation & Self-Reward AlignmentUses a Variational Information Bottleneck (VIB) to distill stable, utterance-level expressive intent that directly steers speech generation, and employs a label-free self-reward mechanism to critique its own emotion and prosody outputs — enabling closed-loop alignment without external annotations.
- Resource-Efficient & AccessibleAchieves rich, natural expressiveness with only 3B parameters and 800 hours of expressive speech data, making it highly training-friendly and easy to reproduce.
- Approaches GPT-4o-Audio — High Expressive Quality Expressive quality encompasses contextually appropriate emotion (accurate affect with strong intensity), prosody (humanlike pausing, stress, and rhythm), and overall naturalness. Our model surpasses all open-source baselines—including models 10× larger (e.g., 30B)—with +10.58pt emotion alignment and richer pitch variation (F0-Var: 63.44 vs. 49.76), while closing to within 0.08pt of GPT-4o-Audio in Overall Performance (4.33 vs. 4.41).
Navigate to Demo Sections
Evaluation on EchoMind Benchmark: This section presents the experimental results from our paper evaluated on the EchoMind benchmark. To illustrate the model's expressive performance, we sampled one representative data point for each emotion dimension as a demonstration.
Note: CosyVoice serves as a TTS baseline, which generates audio using the reference text conditioned on the golden emotion labels. SA-SLM (Ours) is highlighted.
Note: CosyVoice serves as a TTS baseline, which generates audio using the reference text conditioned on the golden emotion labels. SA-SLM (Ours) is highlighted.
| Emotion | Question Speech / Text | SA-SLM (Ours) | GPT-4o-Audio | Qwen3-Omni-30B | CosyVoice(TTS) |
|---|---|---|---|---|---|
| Angry | |||||
| Fearful | |||||
| Sad | |||||
| Disgusted | |||||
| Surprised | |||||
| Neutral | |||||
| Happy |
Open-Domain Generalization & Practicality: This section showcases SA-SLM's practical utility across diverse daily topics, including samples with complex speech instructions. Without targeted training, SA-SLM demonstrates strong zero-shot generalization—driven by its Semantic-Acoustic Alignment, it naturally produces vivid and contextually appropriate speech. This highlights its potential for real-world applications such as voice acting, advanced conversational TTS, and spoken dialogue dataset construction. SA-SLM (Ours) is highlighted.
| Scenario | User Question | SA-SLM (Ours) | Qwen3-Omni-30B |
|---|---|---|---|
| 🙏 Sincere Apology | |||
| 😠 Expressing Anger | |||
| 🗓️ Weekend Planning | |||
| 😢 Telling a Sad Story | |||
| 💪 Encouragement After Failure | |||
| 💔 Comforting a Friend |