AI Native Daily Paper Digest – 20260326

1. SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

๐Ÿ”‘ Keywords: 3D Gaussian Splatting, SpectralSplats, vanishing gradient, frequency domain, Novel View Synthesis

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To address the vanishing gradient issue in 3D Gaussian Splatting tracking by transforming the optimization objective to the frequency domain using spectral moments and implementing a frequency annealing schedule.

๐Ÿ› ๏ธ Research Methods:

– The methodology involves supervising the rendered image using global complex sinusoidal features, known as Spectral Moments, and crafting a Frequency Annealing schedule to transition from global convexity to spatial alignment seamlessly.

๐Ÿ’ฌ Research Conclusions:

– The proposed SpectralSplats framework successfully provides a global basin of attraction, eliminating the vanishing gradient issue, and can seamlessly replace spatial losses across various deformation parameterizations, ensuring effective tracking even with severely misaligned initial conditions.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24036

2. LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

๐Ÿ”‘ Keywords: Neural networks, Novel View Synthesis, 3D reconstruction, Encoder-decoder architecture, Real-time rendering

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To demonstrate the improved performance of neural networks on 3D tasks like Novel View Synthesis by incorporating 3D inductive biases through encoder-decoder architectures.

๐Ÿ› ๏ธ Research Methods:

– Introduction of LagerNVS, an encoder-decoder neural network that utilizes pre-trained 3D-aware latent features, paired with a lightweight decoder and trained end-to-end with photometric losses.

๐Ÿ’ฌ Research Conclusions:

– LagerNVS achieves state-of-the-art deterministic feed-forward performance in Novel View Synthesis, renders in real time, generalizes to diverse data, and can be extended for generative extrapolation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.20176

3. GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

๐Ÿ”‘ Keywords: Multimodal LLMs, GameplayQA, Cognitive Complexity, Agentic Perception, World Modeling

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The study aims to evaluate the perception and reasoning capabilities of multimodal large language models in 3D environments using a new framework, GameplayQA.

๐Ÿ› ๏ธ Research Methods:

– The study introduces dense annotations of multiplayer 3D gameplay videos at a rate of 1.22 labels per second, structuring data around a triadic system for Self, Other Agents, and the World. It refines 2.4K diagnostic QA pairs organized into cognitive complexity levels.

๐Ÿ’ฌ Research Conclusions:

– Evaluations using GameplayQA indicate a significant gap between current MLLMs and human performance, particularly in temporal grounding, agent-role attribution, and decision density management in gameplay scenarios.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24329

4. 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

๐Ÿ”‘ Keywords: Mixed-Precision Quantization, Video Diffusion Transformers, Temporal Delta Cache, NVFP4, INT8

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To reduce memory usage and computational cost in video diffusion transformers while maintaining generation quality.

๐Ÿ› ๏ธ Research Methods:

– Proposing an inference time NVFP4/INT8 Mixed-Precision Quantization framework using a lightweight predictor for dynamic precision allocation.

– Introducing Temporal Delta Cache to skip computations for temporally invariant blocks.

๐Ÿ’ฌ Research Conclusions:

– Achieved 1.92 times end-to-end acceleration and 3.32 times memory reduction, establishing a new baseline for efficient inference in Video DiTs.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.18742

5. 4DGS360: 360ยฐ Gaussian Reconstruction of Dynamic Objects from a Single Video

๐Ÿ”‘ Keywords: 360ยฐ dynamic object reconstruction, diffusion-free, 3D-native initialization, 3D tracker, AnchorTAP3D

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– The objective is to develop a diffusion-free framework, 4DGS360, for 360ยฐ dynamic object reconstruction from casual monocular video, addressing the challenge of maintaining geometric consistency, particularly in occluded regions.

๐Ÿ› ๏ธ Research Methods:

– Employ a 3D-native initialization approach that reduces geometric ambiguity in occluded regions.

– Utilize the 3D tracker AnchorTAP3D to establish reinforced 3D point trajectories by using confident 2D track points as anchors.

๐Ÿ’ฌ Research Conclusions:

– 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, demonstrating both qualitative and quantitative improvements in 360ยฐ 4D reconstructions.

– Introduced a new benchmark, iPhone360, which allows for comprehensive 360ยฐ evaluations previously unavailable with existing datasets.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.21618

6. UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

๐Ÿ”‘ Keywords: mobile GUI agent, Multimodal Large Language Models, Rejection Fine-Tuning, Group Relative Self-Distillation

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– To improve the efficiency and performance of mobile GUI automation tasks through a novel two-stage self-evolving approach.

๐Ÿ› ๏ธ Research Methods:

– Utilization of Rejection Fine-Tuning (RFT) for co-evolution of data and models in an autonomous loop.

– Implementation of Group Relative Self-Distillation (GRSD) to identify critical fork points and construct dense supervision from successful trajectories.

๐Ÿ’ฌ Research Conclusions:

– The UI-Voyager agent achieves an 81.0% Pass@1 success rate, surpassing recent baselines and human-level performance.

– Ablation and case studies confirm the effectiveness of GRSD in enhancing mobile GUI automation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24533

7. Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

๐Ÿ”‘ Keywords: Large Language Models, Resource Allocation, Uncertainty, Enterprise Simulator, AI Systems

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To evaluate the ability of large language models to perform long-horizon enterprise resource allocation under conditions of uncertainty using the EnterpriseArena benchmark.

๐Ÿ› ๏ธ Research Methods:

– Introduced EnterpriseArena, a benchmark that simulates CFO-style decision-making over 132 months using firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules.

– Conducted experiments on eleven advanced large language models to assess their performance in a partially observable environment.

๐Ÿ’ฌ Research Conclusions:

– Current LLM agents face significant challenges in long-horizon resource allocation under uncertainty, as evidenced by only 16% of experiment runs surviving the full horizon.

– Larger models do not consistently outperform smaller ones, highlighting a capability gap in managing resource allocation over extended periods.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23638

8. T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

๐Ÿ”‘ Keywords: T-MAP, Adversarial Prompts, Trajectory-aware, Safety Guardrails, Autonomous LLM Agents

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The objective is to uncover agent-specific vulnerabilities in large language models that arise through multi-step tool execution, specifically within the Model Context Protocol (MCP) ecosystem.

๐Ÿ› ๏ธ Research Methods:

– The paper employs T-MAP, a trajectory-aware evolutionary search method, to guide the discovery of adversarial prompts. This method uses execution trajectories to automate the generation of attacks that can bypass safety guardrails and accomplish harmful objectives through tool interactions.

๐Ÿ’ฌ Research Conclusions:

– Empirical evaluations demonstrate that T-MAP significantly outperforms baseline methods in attack realization rates and remains effective against cutting-edge models such as GPT-5.2, revealing underexplored vulnerabilities in autonomous LLM agents.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.22341

9. UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

๐Ÿ”‘ Keywords: 3D Scenes, Multimodal Large Language Model, Joint Reasoning, Task Decomposition, Active Spatial-Temporal Grounding

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To enable 3D scene functionality segmentation using a unified framework treating multimodal large language models as active observers.

๐Ÿ› ๏ธ Research Methods:

– Introduces UniFunc3D, a training-free framework for semantic, temporal, and spatial reasoning to ground task decomposition in visual evidence using adaptive frame selection.

๐Ÿ’ฌ Research Conclusions:

– UniFunc3D achieves state-of-the-art performance on SceneFun3D without task-specific training, outperforming existing methods with a 59.9% mIoU improvement.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23478

10. OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

๐Ÿ”‘ Keywords: OmniWeaving, AI-generated video, multimodal composition, open-source, intelligent agent

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The primary objective of this research is to introduce OmniWeaving, a model designed to unify multimodal inputs and complex reasoning capabilities for omni-capable video generation.

๐Ÿ› ๏ธ Research Methods:

– Large-scale pretraining with a diverse dataset and the development of a novel benchmark called IntelligentVBench to assess unified video generation models.

๐Ÿ’ฌ Research Conclusions:

– OmniWeaving achieves state-of-the-art performance among open-source unified models in intelligent video generation, demonstrating its capacity to seamlessly integrate various inputs and act as an intelligent agent.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24458

11.

๐Ÿ‘‰ Paper link: 

12. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

๐Ÿ”‘ Keywords: TRACE, Multimodal Large Language Models, 3D spatial reasoning, text-based representations, Egocentric Video

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The study aims to enhance Multimodal Large Language Models (MLLMs) to perform effective 3D spatial reasoning through the introduction of the TRACE method. This method focuses on generating text-based representations from video to facilitate better spatial understanding.

๐Ÿ› ๏ธ Research Methods:

– The TRACE approach introduces a novel prompting method that leverages text-based representations of 3D environments to improve spatial question answering. It encodes meta-context, camera trajectories, and detailed object entities to enable structured reasoning over egocentric videos.

๐Ÿ’ฌ Research Conclusions:

– The TRACE method showcases notable improvements in spatial reasoning performance across various MLLM models and structures. Through extensive experiments on VSI-Bench and OST-Bench, the study demonstrates consistent enhancements over previous prompting strategies, supported by ablation studies and detailed analyses of 3D spatial reasoning bottlenecks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23404

13. StreamingClaw Technical Report

๐Ÿ”‘ Keywords: Streaming Video Understanding, Embodied Intelligence, Real-time Reasoning, Proactive Interaction, Multimodal Long-term Memory

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– The objective is to propose StreamingClaw, a unified framework to overcome fragmentation in current agents and enable real-time streaming video understanding and embodied intelligence.

๐Ÿ› ๏ธ Research Methods:

– The framework integrates five core capabilities including real-time streaming reasoning, multimodal long-term storage, and a perception-decision-action closed loop. It is also compatible with the OpenClaw framework for enhanced support.

๐Ÿ’ฌ Research Conclusions:

– StreamingClaw successfully combines online real-time reasoning, multimodal long-term memory, and proactive interaction, facilitating direct control over the physical world and supporting practical deployment in real-world environments.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.22120

14. Qworld: Question-Specific Evaluation Criteria for LLMs

๐Ÿ”‘ Keywords: large language models, open-ended questions, evaluation criteria, recursive expansion tree, HealthBench

๐Ÿ’ก Category: AI in Healthcare

๐ŸŒŸ Research Objective:

– To develop a method called Qworld that generates question-specific evaluation criteria for better assessment of large language model capabilities on health-related questions.

๐Ÿ› ๏ธ Research Methods:

– Utilized a recursive expansion tree to decompose questions into scenarios and fine-grained criteria, ensuring structured hierarchical and horizontal expansion.

๐Ÿ’ฌ Research Conclusions:

– Qworld covers 89% of expert-authored criteria and creates 79% novel criteria validated by experts, offering greater insight and granularity.

– It reveals capability differences in large language models across dimensions like long-term impact, equity, error handling, and interdisciplinary reasoning, which are not distinguished by traditional rubrics.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23522

15. The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics

๐Ÿ”‘ Keywords: Generative video models, Visual Chronometer, Physical Frames Per Second, temporal ambiguity, chronometric hallucination

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Address temporal ambiguity in generative video models by developing a Visual Chronometer to estimate real-world frame rates from visual dynamics.

๐Ÿ› ๏ธ Research Methods:

– Introduced a method to predict Physical Frames Per Second (PhyFPS) using controlled temporal resampling.

– Established benchmarks PhyFPS-Bench-Real and PhyFPS-Bench-Gen to quantify temporal scale issues.

๐Ÿ’ฌ Research Conclusions:

– Current state-of-the-art video generators are misaligned with real-world temporal scales, leading to unnatural perceived motion.

– Applying PhyFPS corrections enhances the naturalness of AI-generated videos.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.14375

16. Understanding the Challenges in Iterative Generative Optimization with LLMs

๐Ÿ”‘ Keywords: Generative optimization, Large language models, Self-improving agents, Execution feedback, Learning loops

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To investigate the challenges in generative optimization using large language models, focusing on implicit design decisions impacting success across different applications.

๐Ÿ› ๏ธ Research Methods:

– The study examines three critical factors: the starting artifact, credit horizon for execution traces, and batching trials into learning evidence through case studies involving MLAgentBench, Atari, and BigBench Extra Hard.

๐Ÿ’ฌ Research Conclusions:

– Implicit design choices in setting up learning loops can significantly affect generative optimization success. The paper highlights the need for practical guidance to make these decisions explicit, emphasizing the lack of a universal method as a major barrier for adoption.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23994

17.

๐Ÿ‘‰ Paper link: 

18. PLDR-LLMs Reason At Self-Organized Criticality

๐Ÿ”‘ Keywords: PLDR-LLMs, self-organized criticality, reasoning capabilities, second-order phase transitions, metastable steady state

๐Ÿ’ก Category: Foundations of AI

๐ŸŒŸ Research Objective:

– The study aims to explore the reasoning capabilities of PLDR-LLMs pretrained at self-organized criticality, highlighting their ability to generalize reasoning without traditional benchmark evaluations.

๐Ÿ› ๏ธ Research Methods:

– The research analyzes the characteristics of deductive outputs, comparing them to second-order phase transitions, and examines correlation lengths and metastable steady states to understand the model’s learning and representation at criticality.

๐Ÿ’ฌ Research Conclusions:

– It concludes that PLDR-LLMs can generalize reasoning capabilities, which can be quantified solely by deducing global model parameter values at a metastable steady state, reducing reliance on curated benchmark datasets.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.23539

19. CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

๐Ÿ”‘ Keywords: Computer-use agents, expert video demonstrations, continuous video, CUA-Suite, multimodal corpus

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The research introduces CUA-Suite, aimed at advancing desktop automation through a large-scale ecosystem of expert video demonstrations and annotations.

๐Ÿ› ๏ธ Research Methods:

– Utilization of VideoCUA to provide approximately 10,000 human-demonstrated tasks with continuous recordings and comprehensive annotations, forming a rich dataset for professional desktop applications.

๐Ÿ’ฌ Research Conclusions:

– Current action models struggle with desktop applications, as indicated by a ~60% task failure rate, emphasizing the importance of continuous video streams and the potential for CUA-Suite to support diverse research avenues.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24440

20. Toward Physically Consistent Driving Video World Models under Challenging Trajectories

๐Ÿ”‘ Keywords: PhyGenesis, world models, autonomous driving simulation, physical consistency, CARLA simulator

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The research aims to develop PhyGenesis, a world model that generates high-fidelity driving videos with strong physical consistency by transforming invalid trajectories into plausible conditions.

๐Ÿ› ๏ธ Research Methods:

– The approach comprises two components: a physical condition generator and a physics-enhanced video generator, trained using a large-scale, physics-rich dataset including both real-world and simulated driving scenarios.

๐Ÿ’ฌ Research Conclusions:

– PhyGenesis outperforms existing state-of-the-art methods, especially in generating videos from challenging trajectories with superior physical consistency.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24506

21. EVA: Efficient Reinforcement Learning for End-to-End Video Agent

๐Ÿ”‘ Keywords: EVA, video understanding, adaptive reasoning, reinforcement learning, multimodal large language models

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– Develop an efficient reinforcement learning framework named EVA for enhancing video understanding with adaptive reasoning.

๐Ÿ› ๏ธ Research Methods:

– Implement iterative summary-plan-action-reflection reasoning enabling EVA to autonomously decide on video analysis aspects.

– Design a three-stage learning pipeline involving supervised fine-tuning, Kahneman-Tversky Optimization, and Generalized Reward Policy Optimization.

๐Ÿ’ฌ Research Conclusions:

– EVA outperforms existing methods by 6-12% over traditional MLLM baselines and 1-3% over previous adaptive agent methods in video understanding benchmarks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.22918

22. CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

๐Ÿ”‘ Keywords: Long-horizon automation, Multimodal agent, Healthcare, AI-generated summary, State-of-the-art performance

๐Ÿ’ก Category: AI in Healthcare

๐ŸŒŸ Research Objective:

– The research aims to address the unexplored long-horizon automation in healthcare by introducing CareFlow, a benchmark designed for complex medical environments.

๐Ÿ› ๏ธ Research Methods:

– The study employs a multimodal agent framework called CarePilot, utilizing actor-critic methods with dual-memory mechanisms to enhance task execution and reasoning.

๐Ÿ’ฌ Research Conclusions:

– CarePilot significantly outperforms existing multimodal baselines, achieving state-of-the-art results and improving execution performance by approximately 15.26% over closed-source and 3.38% over open-source models on the CareFlow benchmark.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24157

23. When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

๐Ÿ”‘ Keywords: Multi-modal reasoning, Self-evolution training, Unsupervised learning, Policy optimization, Mathematical reasoning

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To improve performance in multimodal reasoning tasks without using costly annotated data or teacher-model distillation by utilizing a self-evolution training framework.

๐Ÿ› ๏ธ Research Methods:

– Implementation of an unsupervised self-evolution training framework using self-consistency signals and group-relative policy optimization.

– The approach employs multiple reasoning trajectories and bounded Judge modulation for reweighting quality trajectories, aiming for robust policy updates.

๐Ÿ’ฌ Research Conclusions:

– The proposed method consistently enhances reasoning performance and generalization across five mathematical reasoning benchmarks, demonstrating a scalable path for self-evolving multimodal models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.21289

24. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

๐Ÿ”‘ Keywords: Self-distillation, Mathematical Reasoning, Uncertainty Expression, Out-of-distribution, Reasoning Behavior

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– Investigate the impact of self-distillation on mathematical reasoning performance in large language models, specifically focusing on the expression of uncertainty and its effect on out-of-distribution tasks.

๐Ÿ› ๏ธ Research Methods:

– Conduct controlled experiments varying the conditioning context richness and task coverage to observe the effects on uncertainty expression and performance across different models like Qwen3-8B, obtaining insights into changes in reasoning behavior.

๐Ÿ’ฌ Research Conclusions:

– Self-distillation can degrade mathematical reasoning by suppressing uncertainty expression, affecting out-of-distribution performance with observed performance drops of up to 40%. It emphasizes the importance of optimized reasoning behavior beyond merely reinforcing correct answers.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2603.24472

Blank Form (#4)
[email protected]

About

Ecosystem

Copyright 2026 AI Native Foundationยฉ . All rights reserved.โ€‹