China AI Native Industry Insights - 20260601 - Alibaba | StepFun

Explore Alibaba’s Qwen-VLA, Bailian’s CLI tool, Stepfun’s Step 3.7 Flash. Discover more in Today’s China AI Native Industry Insights.

1. Alibaba introduces Qwen-VLA unified Vision-Language-Action model for embodied intelligence across 11 robot platforms

Alibaba’s Tongyi Lab announced Qwen-VLA, a unified embodied foundation model that extends Qwen3.5-4B with a 1.15B DiT decoder to combine manipulation, navigation, and trajectory prediction in a single framework. The model uses embodiment-aware prompts to operate across 11 different robot embodiments including single-arm, dual-arm, and humanoid platforms without requiring task-specific architecture modifications or separate policy heads. Research results show strong performance across multiple benchmarks including 97.9% on LIBERO, 73.7% on Simpler-WidowX, and 76.9% average out-of-distribution success in real-world experiments.

Video Credit: @Ali_TongyiLab on X

2. Alibaba Cloud Bailian releases open-source CLI tool for AI Agent integration with 150+ models

Alibaba Cloud announced the open-source release of Bailian CLI, a command-line tool designed for AI Agent integration. The tool enables agents to access over 150 models, more than 10 applications, and capabilities including knowledge bases, memory, and web search through a single command. It natively supports mainstream AI Agent frameworks such as Claude Code, Qoder, OpenClaw, and Hermes Agent. The CLI has been released as open source on GitHub under the modelstudioai/cli repository, providing developers with unified access to Alibaba Cloud Bailian platform capabilities through structured commands and multi-modal task processing.

Video Credit: The original article

3. Stepfun releases and open-sources Step 3.7 Flash, a sparse MoE model optimized for production-grade AI agents

Stepfun has officially released and open-sourced Step 3.7 Flash, a new generation Flash model designed for production-grade AI agents. The model features a sparse MoE architecture with 196B plus 1.8B parameters for vision, activating only 11B parameters, and achieves up to 400 tokens per second generation speed. It provides native multimodal understanding, enhanced web and visual search, reliable tool calling and orchestration across agent workflows, and compatibility with mainstream agent frameworks including Claude Code, KiloCode, OpenClaw, and Hermes Agent. The model is available via API platforms, GitHub, Hugging Face, and ModelScope, and supports both cloud and local deployment with optimized quantization formats.

Video Credit: The original article

That’s all for today’s China AI Native Industry Insights. Join us at AI Native Foundation Membership Dashboard for the latest insights on AI Native, or follow our linkedin account at AI Native Foundation and our twitter account at AINativeF.

China AI Native Industry Insights – 20260601 – Alibaba | StepFun | more

1. Alibaba introduces Qwen-VLA unified Vision-Language-Action model for embodied intelligence across 11 robot platforms

2. Alibaba Cloud Bailian releases open-source CLI tool for AI Agent integration with 150+ models

3. Stepfun releases and open-sources Step 3.7 Flash, a sparse MoE model optimized for production-grade AI agents

About

Insights

Case Study

Legal