AI Native Daily Paper Digest – 20251226

1. Latent Implicit Visual Reasoning

๐Ÿ”‘ Keywords: Large Multimodal Models, visual reasoning, task-agnostic, supervised learning, state-of-the-art

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Address the limitations of Large Multimodal Models in handling predominantly visual reasoning tasks by introducing a task-agnostic mechanism.

๐Ÿ› ๏ธ Research Methods:

– Develop a method to train LMMs using visual reasoning tokens without explicit supervision to enable task-adaptive image re-encoding.

๐Ÿ’ฌ Research Conclusions:

– The proposed approach outperforms direct fine-tuning and achieves state-of-the-art results across a range of vision-centric tasks, enhancing generalization to multi-task instruction tuning.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.21218

2. Spatia: Video Generation with Updatable Spatial Memory

๐Ÿ”‘ Keywords: Spatia, Spatial Memory, 3D Scene Point Cloud, Visual SLAM, Video Generation

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To introduce Spatia, a spatial memory-aware video generation framework that maintains long-term spatial and temporal consistency in video generation.

๐Ÿ› ๏ธ Research Methods:

– Utilizes a 3D scene point cloud as persistent spatial memory and employs visual SLAM and dynamic-static disentanglement for maintaining spatial consistency.

๐Ÿ’ฌ Research Conclusions:

– The framework successfully enables realistic video generation, allowing for explicit camera control and 3D-aware interactive editing, providing a geometrically grounded approach for scalable video production.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.15716

3. How Much 3D Do Video Foundation Models Encode?

๐Ÿ”‘ Keywords: Video Foundation Models, 3D awareness, state-of-the-art video generation models, 3D objects and scenes, scalable 3D models

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To quantify the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data.

๐Ÿ› ๏ธ Research Methods:

– A model-agnostic framework was proposed to measure the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs.

๐Ÿ’ฌ Research Conclusions:

– State-of-the-art video generation models demonstrate a strong 3D understanding of objects and scenes, even surpassing expert models specifically trained for 3D tasks, despite not being trained on any 3D data.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.19949

4. GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

๐Ÿ”‘ Keywords: Multi-turn Reinforcement Learning, Vision-Language Models, GTR-Turbo

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– To develop GTR-Turbo, an efficient reinforcement learning method that improves upon Guided Thought Reinforcement without relying on expensive teacher models.

๐Ÿ› ๏ธ Research Methods:

– GTR-Turbo merges weights from checkpoints during RL training and uses this merged model to guide training via supervised fine-tuning or soft logit distillation.

๐Ÿ’ฌ Research Conclusions:

– GTR-Turbo enhances baseline model accuracy by 10-30%, while cutting training time by 50% and compute cost by 60% compared to GTR, all without using privileged models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.13043

5.

๐Ÿ‘‰ Paper link: 

6. VA-ฯ€: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

๐Ÿ”‘ Keywords: VA-ฯ€, Autoregressive (AR) visual generation, tokenizers, intrinsic reward, reinforcement-based alignment

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To improve image quality and performance of autoregressive visual generators using VA-ฯ€, without retraining tokenizers or employing external rewards.

๐Ÿ› ๏ธ Research Methods:

– Implementation of a post-training framework that directly optimizes AR models with a pixel-space objective, employing a reinforcement-based alignment strategy as intrinsic reward.

๐Ÿ’ฌ Research Conclusions:

– VA-ฯ€ enables significant improvement in image generation quality, as evidenced by reduced FID and improved IS scores on LlamaGen-XXL, and achieves notable gains in text-to-image tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.19680

7. Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models

๐Ÿ”‘ Keywords: Large language models, Reasoning traces, ThinkARM, Schoenfeld’s Episode Theory

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– To identify and analyze the cognitive structure and steps of reasoning in large language models beyond surface-level statistics through the introduction of the ThinkARM framework.

๐Ÿ› ๏ธ Research Methods:

– Utilization of Schoenfeld’s Episode Theory as a lens to abstract reasoning traces into functional reasoning steps like Analysis, Explore, Implement, Verify, etc., and the application of this abstraction to mathematical problem solving in various models.

๐Ÿ’ฌ Research Conclusions:

– The abstraction reveals structural differences between reasoning and non-reasoning models, highlights exploration as a critical branching step for correctness, and shows that efficiency-oriented methods selectively suppress evaluative feedback rather than uniformly shortening responses.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.19995

8. Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

๐Ÿ”‘ Keywords: Autoregressive models, Reinforcement Learning, Hierarchical structure, Internal RL

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– This study investigates how to improve efficiency in reinforcement learning by exploring within the internal representations of large-scale autoregressive models, focusing on overcoming inefficiencies when rewards are sparse.

๐Ÿ› ๏ธ Research Methods:

– The approach introduces a higher-order, non-causal sequence model that controls the residual stream activations of a base autoregressive model, allowing for the discovery of temporally-abstract actions and latent action generation.

๐Ÿ’ฌ Research Conclusions:

– The research demonstrates that using internal reinforcement learning (“internal RL”) enables efficient learning from sparse rewards and shows potential for implementing hierarchical reinforcement learning within foundational models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2512.20605

Blank Form (#4)
[email protected]

About

Ecosystem

Copyright 2025 AI Native Foundationยฉ . All rights reserved.โ€‹