Xiaomi open sources its first native end-to-end speech model
On September 19th, Xiaomi officially open-sourced its first native end-to-end speech model, Xiaomi-MiMo-Audio. Based on an innovative pre-training architecture and hundreds of millions of hours of training data, it achieved few-shot generalization in the speech domain using ICL for the first time, and observed significant "emergent" behavior during pre-training. MiMo-Audio significantly outperformed open-source models with the same number of parameters in multiple standard evaluation benchmarks, including general speech understanding and conversation, achieving a 7B best performance. On the standard test set of the audio understanding benchmark MMAU, MiMo-Audio surpassed Google's closed-source speech model, Gemini-2.5-Flash. In the Big Bench Audio S2T task, a benchmark for complex audio reasoning, MiMo-Audio also surpassed OpenAI's closed-source speech model, GPT-4o-Audio-Preview.