Skip to content

OpenMOSS/MOSS-Audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MOSS-Audio

WeChat

English | 简体中文

MOSS-Audio is an open-source audio understanding model from MOSI.AI, the OpenMOSS team, and Shanghai Innovation Institute. It performs unified modeling over complex real-world audio, supporting speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning. In this release, we provide four models: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.

News

  • 2026.4.20: We have added the MOSS-Audio fine-tuning code and documentation. See finetune/FINETUNE.md for LoRA and full-parameter training examples.
  • 2026.4.13: 🎉🎉🎉 We have released MOSS-Audio. Blog and paper coming soon!

Contents

Introduction

Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. MOSS-Audio is built to unify these capabilities within a single model.

  • Speech & Content Understanding: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.
  • Speaker, Emotion & Event Analysis: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.
  • Scene & Sound Cue Extraction: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.
  • Music Understanding: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.
  • Audio Question Answering & Summarization: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.
  • Time-Aware QA: Supports time-aware questions, including word-level and sentence-level timestamp ASR.
  • Complex Reasoning: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.

Model Architecture

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by MOSS-Audio-Encoder into continuous temporal representations at 12.5 Hz, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.

DeepStack Cross-Layer Feature Injection

Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a DeepStack-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.

This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.

Time-Aware Representation

Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a time-marker insertion strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.

Released Models

Model Audio Encoder LLM Backbone Total Size Hugging Face ModelScope
MOSS-Audio-4B-Instruct MOSS-Audio-Encoder Qwen3-4B ~4.6B Hugging Face ModelScope
MOSS-Audio-4B-Thinking MOSS-Audio-Encoder Qwen3-4B ~4.6B Hugging Face ModelScope
MOSS-Audio-8B-Instruct MOSS-Audio-Encoder Qwen3-8B ~8.6B Hugging Face ModelScope
MOSS-Audio-8B-Thinking MOSS-Audio-Encoder Qwen3-8B ~8.6B Hugging Face ModelScope

More model families, sizes, and variants will be released in the future. Stay tuned!

Evaluation

We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:

  • General Audio Understanding: MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08, with 77.33 on MMAU, 64.92 on MMAU-Pro, 66.53 on MMAR, and 75.52 on MMSU, outperforming all open-source models.
  • Speech Captioning: MOSS-Audio-Instruct variants lead across 11 out of 13 fine-grained speech description dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score (3.7252).
  • ASR: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the lowest overall CER (11.30), with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.
  • Timestamp ASR: MOSS-Audio-8B-Instruct achieves 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.

General Audio Understanding (Accuracy↑)

Model Model Size MMAU MMAU-Pro MMAR MMSU Avg
Open Source (small)
Kimi-Audio7B72.4156.5860.8254.7461.14
Qwen2.5-Omni7B65.6052.2056.7061.3258.96
Audio Flamingo 37B61.2351.7057.9660.0457.73
MiMo-Audio-7B7B74.9053.3561.7061.9462.97
MiniCPM-o-4.59B70.9739.6555.7560.9656.83
MOSS-Audio-4B-Instruct4B75.7958.1659.6859.6864.04
MOSS-Audio-4B-Thinking4B77.6460.7563.9171.2068.37
MOSS-Audio-8B-Instruct8B77.0357.4864.4266.3666.32
MOSS-Audio-8B-Thinking8B77.3364.9266.5375.5271.08
Open Source (large)
Qwen3-Omni-30B-A3B-Instruct30B75.0061.2266.4069.0067.91
Step-Audio-R1.133B72.1860.8068.7564.1866.48
Step-Audio-R133B78.6759.6869.1575.1870.67
Closed Source
GPT4o-Audio-65.6652.3059.7858.7659.13
Gemini-3-Pro-80.1568.2881.7381.2877.86
Gemini-3.1-Pro-81.1073.4783.7081.3079.89

Speech Captioning (LLM-as-a-Judge Score↑)

Speech Captioning (click to expand)
Model gender age accent pitch volume speed texture clarity fluency emotion tone personality summary Avg
Qwen3-Omni-30B-A3B-Instruct 4.436 3.936 4.356 3.590 3.682 3.614 3.093 3.521 3.531 3.328 3.224 3.292 3.179 3.5986
Qwen3-Omni-30B-A3B-Thinking 4.419 4.026 4.327 3.610 3.577 3.610 3.179 3.403 3.526 3.232 3.154 3.197 3.107 3.5667
Gemini-3-Pro 4.191 3.835 4.181 3.392 3.254 3.320 2.998 3.347 3.524 3.055 2.997 3.023 2.775 3.3763
Gemini-3.1-Pro 4.436 3.936 4.356 3.590 3.682 3.614 3.093 3.521 3.531 3.328 3.224 3.292 3.179 3.5986
MOSS-Audio-4B-Instruct 4.697 3.980 4.497 3.628 3.722 3.564 3.407 3.841 3.744 3.311 3.282 3.305 3.259 3.7105
MOSS-Audio-8B-Instruct 4.683 3.979 4.572 3.682 3.709 3.638 3.403 3.869 3.747 3.314 3.253 3.272 3.307 3.7252

ASR

Model Overall Health Condition Dialect Singing Non-Speech Vocalizations Code-Switching Acoustic Environment (Clean) Acoustic Environment (Noisy) Acoustic Characteristics: Whisper Acoustic Characteristics: Far-Field / Near-Field Multi-Speaker Age Semantic Content
Paraformer-Large 15.77 22.18 43.45 32.34 4.95 12.65 3.11 4.67 5.02 17.46 20.33 14.96 7.14
GLM-ASR-Nano 17.29 24.49 22.39 51.95 4.65 11.88 3.68 5.02 4.94 27.51 28.02 17.19 7.32
Fun-ASR-Nano 12.04 21.99 7.80 19.35 4.76 11.23 2.98 3.46 3.78 18.38 19.82 14.95 6.08
SenseVoice-Small 14.50 24.04 8.89 23.79 4.92 13.90 4.13 4.93 5.57 26.66 24.06 17.63 7.55
Kimi-Audio-7B-Instruct 14.12 21.11 29.34 21.76 4.68 16.38 2.20 2.15 2.66 21.02 20.61 16.74 6.12
Qwen2.5-Omni-3B 15.26 24.65 33.87 24.24 5.54 11.66 2.76 3.56 4.32 22.15 22.91 15.17 7.24
Qwen2.5-Omni-7B 15.05 23.85 31.91 22.69 4.56 12.97 2.52 3.16 3.64 25.38 21.01 16.13 6.78
Qwen3-Omni-30B-A3B-Instruct 11.39 20.73 15.63 16.01 4.73 11.30 2.23 2.47 1.90 17.08 18.15 11.46 5.74
MOSS-Audio-4B-Instruct 11.58 21.11 11.84 10.79 4.01 10.11 3.11 3.72 3.29 18.48 20.33 15.09 8.15
MOSS-Audio-8B-Instruct 11.30 19.18 8.76 9.81 4.31 10.18 2.70 3.20 2.75 24.04 24.36 15.26 7.69
Detailed ASR Results (click to expand)
Model Acoustic Environment (Clean) Acoustic Environment (Noisy) Acoustic Characteristics: Whisper Acoustic Characteristics: Far-Field / Near-Field Multi-Speaker Age Health Condition Semantic Content Code-Switching Dialect Singing Non-Speech Vocalizations
AISHELL-1
test
AISHELL-2
Android | IOS | Mic
THCHS-30
test
MAGICDATA-READ
test
AISHELL6-Whisper
normal | whisper
AliMeeting
Test_Ali_far | Test_Ali_near
AISHELL-4
test
SeniorTalk
sentence
ChildMandarin
test
AISHELL-6A
mild | moderate | severe | StutteringSpeech
AISHELL_6B
LRDWWS | Uncontrol
WenetSpeech
test-meeting
Fleurs
cmn_hans_cn
CS-Dialogue
test
TALCS
test
ASCEND
test
KeSpeech
test
WSYue-ASR-eval
short
MIR-1K
test
openc-pop
test
MNV_17
Paraformer-Large 1.98 3.28 | 3.21 | 3.00 4.07 4.67 1.11 | 8.92 25.64 | 9.27 20.33 17.31 12.60 6.98 | 9.30 | 13.34 | 10.74 47.59 | 45.08 7.88 6.40 10.64 10.77 16.55 11.48 75.42 57.70 6.98 4.95
GLM-ASR-Nano 2.89 3.75 | 3.73 | 3.78 4.23 5.02 0.83 | 9.06 40.27 | 14.76 28.02 20.33 14.06 8.74 | 12.11 | 14.38 | 12.29 50.34 | 49.09 9.70 4.94 11.06 11.07 13.50 9.72 35.07 95.87 8.03 4.65
Fun-ASR-Nano 2.16 3.04 | 2.99 | 3.07 3.65 3.46 0.81 | 6.76 27.21 | 9.55 19.82 16.96 12.94 6.60 | 8.81 | 12.98 | 10.30 47.42 | 45.84 7.39 4.76 10.47 8.09 15.13 7.43 8.17 35.85 2.84 4.76
SenseVoice-Small 3.23 4.16 | 4.02 | 3.96 5.26 4.93 1.25 | 9.88 37.01 | 16.31 24.06 21.07 14.18 7.62 | 9.85 | 14.39 | 11.47 52.92 | 47.97 8.35 6.75 12.81 10.52 18.38 10.45 7.34 39.51 8.07 4.92
Kimi-Audio-7B-Instruct 0.79 2.91 | 3.03 | 2.88 1.39 2.15 0.69 | 4.63 28.22 | 13.82 20.61 19.70 13.79 7.00 | 9.34 | 12.56 | 10.75 44.44 | 42.57 7.15 5.10 14.56 12.74 21.83 5.51 53.17 38.35 5.17 4.68
Qwen2.5-Omni-3B 1.51 3.10 | 2.94 | 2.93 3.32 3.56 0.82 | 7.82 32.14 | 12.16 22.91 17.38 12.96 6.87 | 10.55 | 14.57 | 11.33 54.54 | 50.03 9.04 5.45 10.78 10.94 13.25 7.67 60.06 45.00 3.47 5.54
Qwen2.5-Omni-7B 1.16 2.88 | 2.77 | 2.73 3.06 3.16 0.71 | 6.57 32.03 | 18.73 21.01 19.96 12.29 7.27 | 10.94 | 12.92 | 10.53 51.99 | 49.45 8.43 5.13 14.02 10.46 14.42 6.40 57.43 42.62 2.75 4.56
Qwen3-Omni-30B-A3B-Instruct 0.95 2.70 | 2.72 | 2.57 2.21 2.47 0.59 | 3.22 25.72 | 8.44 18.15 14.13 8.79 6.20 | 8.88 | 11.59 | 10.25 45.80 | 41.65 6.64 4.84 12.94 8.33 12.64 5.87 25.39 30.81 1.21 4.73
MOSS-Audio-4B-Instruct 2.26 3.22 | 3.20 | 3.33 3.53 3.72 0.73 | 5.86 27.27 | 9.68 20.33 16.93 13.25 6.36 | 9.77 | 12.68 | 10.28 43.35 | 44.25 8.17 8.13 9.14 8.37 12.83 14.65 9.04 18.47 3.10 4.01
MOSS-Audio-8B-Instruct 1.82 2.97 | 2.95 | 2.91 2.82 3.20 0.69 | 4.80 36.82 | 11.25 24.36 17.42 13.10 5.84 | 8.94 | 11.52 | 9.72 39.76 | 39.27 7.86 7.52 9.07 8.22 13.26 9.18 8.33 17.24 2.39 4.31

Timestamp ASR (AAS↓)

Model AISHELL-1(zh) LibriSpeech(en)
Qwen3-Omni-30B-A3B-Instruct 833.66 646.95
Gemini-3.1-Pro 708.24 871.19
MOSS-Audio-4B-Instruct 76.96 358.13
MOSS-Audio-8B-Instruct 35.77 131.61

Quickstart

Environment Setup

We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.

Recommended setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio

conda create -n moss-audio python=3.12 -y
conda activate moss-audio

conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Optional: FlashAttention 2

If your GPU supports FlashAttention 2, you can replace the last install command with:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

Basic Usage

Download the model first:

hf  download OpenMOSS-Team/MOSS-Audio-4B-Instruct --local-dir ./weights/MOSS-Audio-4B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-4B-Thinking --local-dir ./weights/MOSS-Audio-4B-Thinking 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Instruct --local-dir ./weights/MOSS-Audio-8B-Instruct 
hf  download OpenMOSS-Team/MOSS-Audio-8B-Thinking --local-dir ./weights/MOSS-Audio-8B-Thinking 

Then edit MODEL_PATH / AUDIO_PATH in infer.py as needed, and run:

python infer.py

The default prompt in infer.py is Describe this audio. You can directly edit that line if you want to try transcription, audio QA, or speech captioning.

Fine-tuning

We now provide an official fine-tuning script in finetune/finetune.py, with full instructions in finetune/FINETUNE.md.

Install the extra dependencies needed for training:

pip install librosa peft

Minimal example for LoRA fine-tuning:

accelerate launch finetune/finetune.py \
    --model_dir ./weights/MOSS-Audio-4B-Instruct \
    --data_path train.jsonl \
    --output_dir ./output/lora \
    --use_lora \
    --bf16

The training data should be a JSONL file containing audio-text conversations. For data format, supported arguments, multi-GPU examples, and full-parameter fine-tuning, see finetune/FINETUNE.md.

Gradio App

Start the Gradio demo with:

python app.py

SGLang Serving

If you want to serve MOSS-Audio with SGLang, see the full guide in moss_audio_usage_guide.md.

The shortest setup is:

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang
pip install -e "python[all]"
pip install nvidia-cudnn-cu12==9.16.0.29
cd ..
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

If you use the default torch==2.9.1+cu128 runtime, installing nvidia-cudnn-cu12==9.16.0.29 is recommended before starting sglang serve.

More Information

LICENSE

Models in MOSS-Audio are licensed under the Apache License 2.0.

Citation

@misc{mossaudio2026,
      title={MOSS-Audio Technical Report},
      author={OpenMOSS Team},
      year={2026},
      howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}},
      note={GitHub repository}
}

Star History

Star History Chart

About

MOSS-Audio is an open-source foundation model for unified audio understanding, enabling speech, sound, music, captioning, QA, and reasoning in real-world scenarios.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages