AI Native Foundation

Explore the expansion of Qwen-TTS with Chinese dialect support, Zhipu AI’s launch of the GLM-4.1V-Thinking Model and new agent platform, and ThinkSound’s breakthrough AI system that generates audio from multiple modalities. Discover more in Today’s China AI Native Industry Insights.

1. Qwen-TTS Expands with Chinese Dialect Support

🔑 Key Details:
– Latest Update: Qwen-TTS now supports three Chinese dialects – Beijing, Shanghai, and Sichuan dialects, alongside standard Mandarin and English.
– Voice Options: Seven bilingual voices available including three dialect-specific voices – Dylan (Beijing), Jada (Shanghai), and Sunny (Sichuan).
– Advanced Technology: Trained on 3+ million hours of audio data, achieving human-level synthesis with automatic adjustments to rhythm, tone, and emotion.
– Performance: Achieves human-level quality in speech synthesis based on SeedTTS-Eval benchmarks.

💡 How It Helps:
– Content Creators: Greater linguistic diversity for more authentic and targeted regional content development.
– Localization Teams: Enhanced ability to produce regionally relevant voice content that resonates with specific Chinese audiences.
– Developers: Simple API integration with provided code snippets for easy implementation of dialect-specific speech synthesis.
– Media Producers: Access to natural-sounding regional Chinese dialects for more authentic audio productions and narratives.

🌟 Why It Matters:
This expansion represents a significant advancement in linguistic inclusivity for AI voice technology, moving beyond standard Mandarin to embrace China’s rich dialectal diversity. By offering authentic regional speech patterns, Qwen-TTS opens new possibilities for more personalized and culturally resonant applications across entertainment, education, and customer service sectors. The improvement in speech synthesis quality further narrows the gap between artificial and human voices.

Original Chinese article: https://mp.weixin.qq.com/s/-VDOJrDgVzC6JI4CVTHe4w

English translation via free online service: https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%2F-VDOJrDgVzC6JI4CVTHe4w

Video Credit: Qwen (@Alibaba_Qwen on X)

2. Zhipu AI Launches GLM-4.1V-Thinking Model and New Agent Platform

🔑 Key Details:
– New VLM Model: GLM-4.1V-Thinking, a 9B parameter model that outperforms larger models in 23 out of 28 benchmarks.
– Agent Platform: Released “Application Space”, an AI Agent aggregation platform with plugins for enterprise users.
– Developer Support: Launched “Agent Explorers Program” with billions in funding for AI Agent startups.
– Open Source: Released both GLM-4.1V-9B-Base and GLM-4.1V-9B-Thinking on HuggingFace and ModelScope.

💡 How It Helps:
– AI Researchers: Access to open-source 9B parameter model that delivers 72B-level performance.
– Enterprise Users: Low-barrier access to mature Agent capabilities without building in-house AI teams.
– Developers: One-stop development toolchain with flexible application combination mechanisms.
– Entrepreneurs: Financial support through dedicated funding for AI-native startups.

🌟 Why It Matters:
Zhipu AI’s dual approach of advancing model capabilities while building ecosystem infrastructure marks a significant shift in China’s AI landscape. By prioritizing reasoning ability in smaller models while simultaneously launching a marketplace for AI applications, Zhipu is enabling a practical AI-native transformation across industries, making advanced AI accessible beyond just performance metrics.

Original Chinese article: https://mp.weixin.qq.com/s/h-rOdWC-lRZF5Fft11vb9A

English translation via free online service: https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%2Fh-rOdWC-lRZF5Fft11vb9A

Video Credit: The original article

3. ThinkSound: Breakthrough AI System Generates Audio from Multiple Modalities

🔑 Key Details:
– Chain-of-Thought Reasoning: ThinkSound uses MLLM-based reasoning to generate audio from video, text, or audio inputs.
– State-of-the-Art Performance: Achieves top results on Video-to-Audio benchmarks with a unified framework.
– Interactive Editing: Supports object-centric refinement and targeted audio modifications through clicks or language instructions.
– Python Implementation: Released with inference scripts and available through Hugging Face and ModelScope demos.

💡 How It Helps:
– Content Creators: Simplifies audio generation workflow with intuitive object selection and natural language editing.
– AI Researchers: Access to SOTA multimodal audio generation architecture with flow matching techniques.
– Media Producers: Enables precise foley sound creation and refinement for specific visual elements.

🌟 Why It Matters:
ThinkSound represents a significant advance in AI-generated audio by making the process both more intuitive and compositional. By decomposing audio generation into reasoning-guided stages, it bridges the gap between human creative intent and machine execution. This research-focused tool could transform media production workflows while establishing a new approach to multimodal generation through reasoning.

Original article: https://github.com/liuhuadai/ThinkSound

Video Credit: Alibaba Wechat Channel

That’s all for today’s China AI Native Industry Insights. Join us at AI Native Foundation Membership Dashboard for the latest insights on AI Native, or follow our linkedin account at AI Native Foundation and our twitter account at AINativeF.

China AI Native Industry Insights – 20250702 – Alibaba | Zhipu AI | more

1. Qwen-TTS Expands with Chinese Dialect Support

2. Zhipu AI Launches GLM-4.1V-Thinking Model and New Agent Platform

3. ThinkSound: Breakthrough AI System Generates Audio from Multiple Modalities

About

Ecosystem

Insights

Legal