Fish Audio Open-Sources S2: Fine-Grained Control Meets Production Streaming

Mar 9, 2026

Open SourceResearch

Fish Audio Open-Sources S2: Fine-Grained Control Meets Production Streaming

S2 open source is available via the project's GitHub repository and HuggingFace.

Fish Audio has open-sourced S2, a text-to-speech model that supports fine-grained inline control of prosody and emotion using natural-language tags like [laugh], [whispers], and [super happy]. Trained on over 10 million hours of audio across approximately 50 languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.

Fine-Grained Inline Control via Natural Language

S2 enables Inline control over speech generation by embedding natural-language instructions directly at specific word or phrase positions within the text. Rather than relying on a fixed set of predefined tags, S2 accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

On the Audio Turing Test, S2 achieves a posterior mean of 0.515 with instruction rewriting, compared to 0.417 for Seed-TTS and 0.387 for MiniMax-Speech. On EmergentTTS-Eval, it reaches an overall win rate of 81.88% against a gpt-4o-mini-tts baseline — the highest among all evaluated models, including closed-source systems from Google and OpenAI.

Example of S2 input format Example of S2 input format showing multi-speaker dialogue with free-form natural-language inline tags for fine-grained control.

A Unified Recipe: Data Curation and RL Rewards from the Same Models

A core architectural decision in S2 is that the same models used to filter and annotate training data are directly reused as reward models during reinforcement learning:

Speech quality model scores audio across dimensions like SNR, speaker consistency, and intelligibility during data filtering — then serves as the acoustic preference reward during RL.
Rich-transcription ASR model (continue pretrained from Qwen3-Omni-30B-A3B) generates caption-augmented transcripts with inline paralinguistic annotations during data curation — then provides the intelligibility and instruction-following reward by re-transcribing generated audio and comparing it against the original prompt.

This dual-purpose design eliminates the distribution mismatch between pre-training data and post-training objectives by construction — a problem that remains unaddressed in other TTS systems that train reward models separately from their data pipelines.

Inside the Model: Dual-AR Architecture

S2 builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate). Flattening all codebooks along time would cause a 10× sequence-length explosion. S2 addresses this with a Dual-Autoregressive (Dual-AR) architecture:

Slow AR operates along the time axis and predicts the primary semantic codebook.
Fast AR generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design — 4B parameters along the time axis, 400M parameters along the depth axis — keeps inference efficient while preserving audio fidelity.

Reinforcement Learning Alignment for Speech

For post-training, S2 uses Group Relative Policy Optimization (GRPO), chosen to avoid the memory overhead of PPO-style value models in long audio contexts. The reward signal combines multiple dimensions, including:

Semantic accuracy and instruction adherence
Acoustic preference scoring
Timbre similarity

Benchmark Results

S2 achieves leading results across multiple public benchmarks:

Benchmark	Fish Audio S2
Seed-TTS Eval — WER (Chinese)	0.54% (best overall)
Seed-TTS Eval — WER (English)	0.99% (best overall)
Audio Turing Test (with instruction)	0.515 posterior mean
EmergentTTS-Eval — Win Rate	81.88% (highest overall)
Fish Instruction Benchmark — TAR	93.3%
Fish Instruction Benchmark — Quality	4.51 / 5.0
Multilingual (MiniMax Testset) — Best WER	11 of 24 languages
Multilingual (MiniMax Testset) — Best SIM	17 of 24 languages

On Seed-TTS Eval, S2 achieves the lowest WER among all evaluated models including closed-source systems: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). On the Audio Turing Test, 0.515 surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%. On EmergentTTS-Eval, S2 achieves particularly strong results in paralinguistics (91.61% win rate), questions (84.41%), and syntactic complexity (83.39%).

Production Streaming via SGLang

Because S2's Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it can directly inherit all LLM-native serving optimizations from SGLang with minimal modification — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

For voice cloning, S2 places reference audio tokens in the system prompt. SGLang's RadixAttention automatically caches these KV states, achieving an average prefix-cache hit rate of 86.4% (over 90% at peak) when the same voice is reused across requests — making reference-audio prefill overhead nearly negligible.

On a single NVIDIA H200 GPU:

Real-Time Factor (RTF): 0.195
Time-to-first-audio: approximately 100 ms
Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5

Why This Release Matters

S2 is released not as a model checkpoint alone, but as a complete system: model weights, fine-tuning code, and a production-ready inference stack.

Two design choices stand out. First, the unified data-and-reward pipeline eliminates a structural problem — distribution mismatch between pre-training and RL — that other TTS systems have not addressed at the architectural level. Second, the structural isomorphism between the Dual-AR architecture and standard LLMs means S2 can leverage the full ecosystem of LLM serving optimizations, rather than requiring custom inference infrastructure.

S2 is available via the project's GitHub repository, SGLang-Omni, HuggingFace, and interactive demo at fish.audio.

Frequently Asked Questions

S2 supports multi-speaker, multi-turn generation natively. By embedding speaker tags and natural-language inline instructions directly in the input, S2 weaves multiple voices together while maintaining consistent timbre, matching prosody, and honoring per-speaker emotional cues — making it suitable for generating full conversations, not just single-voice narration.

Yes. S2 is available via the Fish Audio API at fish.audio. The full model weights, fine-tuning code, and SGLang-based inference engine are also open-sourced on GitHub (github.com/fishaudio/fish-speech) and HuggingFace (huggingface.co/fishaudio/s2-pro) for self-hosting.

S2 accepts free-form natural-language inline tags rather than a fixed predefined set — so you're not limited to a closed vocabulary. Tags like [laugh], [whispers], [super happy], [professional broadcast tone], or [pitch up] can be inserted at any word or phrase position. Because the system is trained on open-ended descriptions, novel tags generalize well beyond the examples seen during training.

S2 is trained on over 10 million hours of audio spanning approximately 80 languages. On the MiniMax multilingual testset — covering 24 languages — S2 achieves the best WER in 11 languages and the best speaker similarity in 17 languages, outperforming both MiniMax and ElevenLabs across the majority of the benchmark: Arabic, Cantonese, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Spanish, Thai, Turkish, Ukrainian, Vietnamese