AI Native Foundation

1. Are We Ready For An Agent-Native Memory System?

🔑 Keywords: Memory systems, Data management, Qwen/Qwen2.5-Coder-32B-Instruct, Retrieval precision, Cost-performance trade-offs

💡 Category: AI Systems and Tools

🌟 Research Objective:

– The study aims to systematically evaluate memory systems for large language model agents from a data management perspective, decomposing agent memory into core modules to understand performance characteristics and trade-offs.

🛠️ Research Methods:

– An analytical framework is proposed to evaluate 12 memory systems and two baselines across five workloads using 11 datasets, incorporating fine-grained ablation studies on multiple dimensions.

💬 Research Conclusions:

– No single memory architecture dominates across all scenarios; effectiveness depends on alignment with workload bottlenecks. Localized maintenance is found to be more cost-efficient. The study highlights directions for developing agent-native memory systems.

👉 Paper link: https://huggingface.co/papers/2606.24775

2. Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

🔑 Keywords: Real-time Audio-Visual Interaction, Causal Attention, Low-latency, Multimodal Model, Transformer

💡 Category: Multi-Modal Learning

🌟 Research Objective:

– To design Wan-Streamer, a unified and interactive multimodal model that facilitates real-time audio-visual interaction through innovative methods such as causal attention.

🛠️ Research Methods:

– Development of an end-to-end Transformer model integrating visual, audio, and text modalities with causal encoders and decoders, coordinated by block-causal attention for incremental streaming.

💬 Research Conclusions:

– Wan-Streamer successfully achieves low latency in audio-visual interactions, with approximately 550 ms total interaction latency, positioning it as a strong contender for sub-second duplex communication in multimodal applications.

👉 Paper link: https://huggingface.co/papers/2606.25041

3. Improved Large Language Diffusion Models

🔑 Keywords: masked diffusion, bidirectional attention, language models, non-autoregressive, efficiency

💡 Category: Natural Language Processing

🌟 Research Objective:

– To present iLLaDA, a novel 8B masked diffusion language model trained with fully bidirectional attention to improve performance on general, mathematical, and code benchmarks.

🛠️ Research Methods:

– The model employs a masked diffusion objective throughout pre-training and supervised fine-tuning with a large-scale token corpus and introduces variable-length generation and confidence-based scoring for efficiency and evaluation.

💬 Research Conclusions:

– iLLaDA demonstrates significant improvements across various benchmarks such as BBH, ARC-Challenge, MATH, and HumanEval, highlighting the competitiveness of fully bidirectional diffusion training as a potent approach for developing strong language models.

👉 Paper link: https://huggingface.co/papers/2606.25331

4. MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

🔑 Keywords: Motion-Aware Diffusion Models, Multi-View Point Tracking, Geometric Consistency, Motion Fidelity

💡 Category: Computer Vision

🌟 Research Objective:

– To synthesize a novel-view video from a monocular reference, ensuring geometric consistency and motion fidelity.

🛠️ Research Methods:

– Introduced MVTrack4Gen, leveraging multi-view point tracking as geometric and motion supervision in camera-conditioning-only diffusion models.

💬 Research Conclusions:

– MVTrack4Gen improves motion-aware correspondences and maintains cross-view geometric consistency, achieving state-of-the-art results across benchmarks.

👉 Paper link: https://huggingface.co/papers/2606.26087

5. UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

🔑 Keywords: UnityShots, Multi-shot audio-video generation, Long-term memory, Short-term memory, Cross-shot coherence

💡 Category: Generative Models

🌟 Research Objective:

– To develop UnityShots, a memory-driven multi-shot audio-video generation system ensuring consistent subject appearance and audio across video cuts.

🛠️ Research Methods:

– Utilized a system with fixed-size long-term and short-term memory slots updated by boundary-conditioned gates, incorporating visual cut probability and beat-tracker signals.

– Injected reference speaker tokens to maintain vocal timbre, leveraging discrete cut-type priors as a control mechanism during inference.

💬 Research Conclusions:

– UnityShots demonstrates superior cross-shot coherence metrics compared to open-source baselines and rivals closed-source systems in multi-shot scenarios.

👉 Paper link: https://huggingface.co/papers/2606.21661

6. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and
Interactive World Models

🔑 Keywords: Autoregressive video diffusion, Causal diffusion transformers, Diffusion distillation, Teacher-forcing, Self-forcing

💡 Category: Generative Models

🌟 Research Objective:

– Extend the diffusion distillation framework to autoregressive video diffusion for real-time streaming generation and interactive world modeling.

🛠️ Research Methods:

– Using teacher-forcing for offline, forward-divergence causal training and self-forcing for on-policy, reverse-divergence refinement.

– Implementing continuous-time consistency models with custom-mask FlashAttention-2 for 10x faster convergence.

💬 Research Conclusions:

– Demonstrated state-of-the-art performance in streaming video generation with synthetic data.

– Achieved a VBench-T2V score of 84.63 with minimal sampling steps using the distilled causal model.

– Applied the framework to Cosmos 3 for enhanced interactive world modeling.

👉 Paper link: https://huggingface.co/papers/2606.25473

7. V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

🔑 Keywords: V-Zero, Fine-grained Visual Reasoning, Contrastive Evidence Gating, On-Policy Distillation, Multimodal Large Language Models

💡 Category: Multi-Modal Learning

🌟 Research Objective:

– The objective of the study is to improve fine-grained visual reasoning without the need for annotated answer labels through the introduction of a new framework called V-Zero.

🛠️ Research Methods:

– The research introduces a label-free framework, V-Zero, which employs contrastive evidence gating. It specifically avoids external answer labels, utilizing a method of pairing a question-relevant regional crop with a negative visual view for token-level distillation.

💬 Research Conclusions:

– V-Zero showcases significant improvements in fine-grained visual reasoning while maintaining strong generalization. It is notably more than 5 times faster than previous supervised fine-tuning methods and over 10 times faster compared to reinforcement learning baselines.

👉 Paper link: https://huggingface.co/papers/2606.25319

8. Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

🔑 Keywords: Multimodal Code Intelligence, Visual Perception, Executable Programs, Verification-Centered Research

💡 Category: Multi-Modal Learning

🌟 Research Objective:

– The survey aims to explore systems that generate and reason with code based on visual inputs, while identifying verification-centered research directions in various domains.

🛠️ Research Methods:

– The paper categorizes approaches across four main domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks, examining how code serves different roles across these tasks.

💬 Research Conclusions:

– Suggests four verification-centered research directions like multi-signal validation and cross-task transfer testing to improve evidence-grounded executable systems and enhance the connection between visual perception and executable programs.

👉 Paper link: https://huggingface.co/papers/2606.15932

9. ShutterMuse: Capture-Time Photography Guidance with MLLMs

🔑 Keywords: Photography Assistance, Multimodal Models, Capture-Time Guidance, Composition Guidance, Pose Recommendations

💡 Category: Multi-Modal Learning

🌟 Research Objective:

– The research focuses on developing a new benchmark and dataset to enhance photography assistance, particularly in providing capture-time composition and pose recommendations.

🛠️ Research Methods:

– The study introduces CaptureGuide-Bench with tasks for both photographer-side composition and subject-side pose recommendation, and constructs CaptureGuide-Dataset with extensive samples and annotations, alongside developing a unified multimodal model, ShutterMuse, using supervised and reinforcement fine-tuning.

💬 Research Conclusions:

– ShutterMuse exhibits superior performance in photographer-side tasks and competitive pose recommendations with lower inference costs, showcasing the potential of multimodal large language models as interactive photography assistants.

👉 Paper link: https://huggingface.co/papers/2606.25763

10. DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

🔑 Keywords: open domain S2V, DomainShuttle, domain-aware modeling, high fidelity, generative flexibility

💡 Category: Generative Models

🌟 Research Objective:

– Develop a method, DomainShuttle, to enable high fidelity and flexibility in open domain subject-driven text-to-video generation across in-domain and cross-domain scenarios.

🛠️ Research Methods:

– Introduce Domain-MoT to decouple videos and reference features, employing domain-aware AdaLN for domain-specific modeling of reference images.

– Implement the Video-Reference DualRoPE scheme for precise subject-level spatial modeling, using separate RoPE spaces.

– Utilize Cross-Pair Consistent Loss for extracting intrinsic subject features unaffected by irrelevant ones.

💬 Research Conclusions:

– DomainShuttle outperforms existing methods, achieving significant performance improvements in subject fidelity and generative flexibility in diverse application scenarios.

👉 Paper link: https://huggingface.co/papers/2606.26058

AI Native Daily Paper Digest – 20260625

1. Are We Ready For An Agent-Native Memory System?

2. Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

3. Improved Large Language Diffusion Models

4. MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

5. UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

6. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and
Interactive World Models

7. V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

8. Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

9. ShutterMuse: Capture-Time Photography Guidance with MLLMs

10. DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

About

Insights

Case Study

Legal

1. Are We Ready For An Agent-Native Memory System?

2. Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

3. Improved Large Language Diffusion Models

4. MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

5. UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

6. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

7. V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

8. Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

9. ShutterMuse: Capture-Time Photography Guidance with MLLMs

10. DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

6. Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and
Interactive World Models