AI Native Foundation

1. Latent Implicit Visual Reasoning

🔑 Keywords: Large Multimodal Models, visual reasoning, task-agnostic, supervised learning, state-of-the-art

💡 Category: Multi-Modal Learning

🌟 Research Objective:

– Address the limitations of Large Multimodal Models in handling predominantly visual reasoning tasks by introducing a task-agnostic mechanism.

🛠️ Research Methods:

– Develop a method to train LMMs using visual reasoning tokens without explicit supervision to enable task-adaptive image re-encoding.

💬 Research Conclusions:

– The proposed approach outperforms direct fine-tuning and achieves state-of-the-art results across a range of vision-centric tasks, enhancing generalization to multi-task instruction tuning.

👉 Paper link: https://huggingface.co/papers/2512.21218

2. Spatia: Video Generation with Updatable Spatial Memory

🔑 Keywords: Spatia, Spatial Memory, 3D Scene Point Cloud, Visual SLAM, Video Generation

💡 Category: Generative Models

🌟 Research Objective:

– To introduce Spatia, a spatial memory-aware video generation framework that maintains long-term spatial and temporal consistency in video generation.

🛠️ Research Methods:

– Utilizes a 3D scene point cloud as persistent spatial memory and employs visual SLAM and dynamic-static disentanglement for maintaining spatial consistency.

💬 Research Conclusions:

– The framework successfully enables realistic video generation, allowing for explicit camera control and 3D-aware interactive editing, providing a geometrically grounded approach for scalable video production.

👉 Paper link: https://huggingface.co/papers/2512.15716

3. How Much 3D Do Video Foundation Models Encode?

🔑 Keywords: Video Foundation Models, 3D awareness, state-of-the-art video generation models, 3D objects and scenes, scalable 3D models

💡 Category: Computer Vision

🌟 Research Objective:

– To quantify the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data.

🛠️ Research Methods:

– A model-agnostic framework was proposed to measure the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs.

💬 Research Conclusions:

– State-of-the-art video generation models demonstrate a strong 3D understanding of objects and scenes, even surpassing expert models specifically trained for 3D tasks, despite not being trained on any 3D data.

👉 Paper link: https://huggingface.co/papers/2512.19949

4. GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

🔑 Keywords: Multi-turn Reinforcement Learning, Vision-Language Models, GTR-Turbo

💡 Category: Reinforcement Learning

🌟 Research Objective:

– To develop GTR-Turbo, an efficient reinforcement learning method that improves upon Guided Thought Reinforcement without relying on expensive teacher models.

🛠️ Research Methods:

– GTR-Turbo merges weights from checkpoints during RL training and uses this merged model to guide training via supervised fine-tuning or soft logit distillation.

💬 Research Conclusions:

– GTR-Turbo enhances baseline model accuracy by 10-30%, while cutting training time by 50% and compute cost by 60% compared to GTR, all without using privileged models.

👉 Paper link: https://huggingface.co/papers/2512.13043

5.

👉 Paper link:

6. VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

🔑 Keywords: VA-π, Autoregressive (AR) visual generation, tokenizers, intrinsic reward, reinforcement-based alignment

💡 Category: Generative Models

🌟 Research Objective:

– To improve image quality and performance of autoregressive visual generators using VA-π, without retraining tokenizers or employing external rewards.

🛠️ Research Methods:

– Implementation of a post-training framework that directly optimizes AR models with a pixel-space objective, employing a reinforcement-based alignment strategy as intrinsic reward.

💬 Research Conclusions:

– VA-π enables significant improvement in image generation quality, as evidenced by reduced FID and improved IS scores on LlamaGen-XXL, and achieves notable gains in text-to-image tasks.

👉 Paper link: https://huggingface.co/papers/2512.19680

7. Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

🔑 Keywords: Large language models, Reasoning traces, ThinkARM, Schoenfeld’s Episode Theory

💡 Category: Knowledge Representation and Reasoning

🌟 Research Objective:

– To identify and analyze the cognitive structure and steps of reasoning in large language models beyond surface-level statistics through the introduction of the ThinkARM framework.

🛠️ Research Methods:

– Utilization of Schoenfeld’s Episode Theory as a lens to abstract reasoning traces into functional reasoning steps like Analysis, Explore, Implement, Verify, etc., and the application of this abstraction to mathematical problem solving in various models.

💬 Research Conclusions:

– The abstraction reveals structural differences between reasoning and non-reasoning models, highlights exploration as a critical branching step for correctness, and shows that efficiency-oriented methods selectively suppress evaluative feedback rather than uniformly shortening responses.

👉 Paper link: https://huggingface.co/papers/2512.19995

8. Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

🔑 Keywords: Autoregressive models, Reinforcement Learning, Hierarchical structure, Internal RL

💡 Category: Reinforcement Learning

🌟 Research Objective:

– This study investigates how to improve efficiency in reinforcement learning by exploring within the internal representations of large-scale autoregressive models, focusing on overcoming inefficiencies when rewards are sparse.

🛠️ Research Methods:

– The approach introduces a higher-order, non-causal sequence model that controls the residual stream activations of a base autoregressive model, allowing for the discovery of temporally-abstract actions and latent action generation.

💬 Research Conclusions:

– The research demonstrates that using internal reinforcement learning (“internal RL”) enables efficient learning from sparse rewards and shows potential for implementing hierarchical reinforcement learning within foundational models.

👉 Paper link: https://huggingface.co/papers/2512.20605

AI Native Daily Paper Digest – 20251226

1. Latent Implicit Visual Reasoning

2. Spatia: Video Generation with Updatable Spatial Memory

3. How Much 3D Do Video Foundation Models Encode?

4. GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

5.

6. VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

7. Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

8. Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

About

Ecosystem

Insights

Legal