AI Native Daily Paper Digest – 20251212

1. T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
๐ Keywords: T-pro 2.0, hybrid reasoning, efficient inference, Cyrillic-dense tokenizer, EAGLE speculative-decoding
๐ก Category: Natural Language Processing
๐ Research Objective:
– The goal is to introduce T-pro 2.0, a Russian LLM optimized for hybrid reasoning and efficient inference.
๐ ๏ธ Research Methods:
– The model was developed using a Cyrillic-dense tokenizer and an EAGLE speculative-decoding pipeline; resources such as model weights and instruction corpora were released on Hugging Face.
๐ฌ Research Conclusions:
– T-pro 2.0 provides an accessible system for extending and evaluating Russian language models, with demonstrated speed improvements in both reasoning and non-reasoning modes.
๐ Paper link: https://huggingface.co/papers/2512.10430

2. Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
๐ Keywords: Reinforcement Learning, Verification, Iterative Active Learning, Rejection Fine-Tuning, Large Language Models
๐ก Category: Knowledge Representation and Reasoning
๐ Research Objective:
– To enhance the verification of long reasoning chains in large language models using the proposed Outcome-based Process Verifier (OPV).
๐ ๏ธ Research Methods:
– Utilization of an iterative active learning framework with expert annotations and Rejection Fine-Tuning to improve OPV’s verification capability efficiently.
๐ฌ Research Conclusions:
– OPV achieves state-of-the-art results, improving accuracy and effectively detecting false positives, with demonstrated superior performance over larger models.
๐ Paper link: https://huggingface.co/papers/2512.10739

3. Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
๐ Keywords: Reinforcement Learning, Text-to-3D Generation, Reward Designs, RL Algorithms, Hierarchical 3D Generation
๐ก Category: Reinforcement Learning
๐ Research Objective:
– Investigate the application of reinforcement learning (RL) for text-to-3D generation, addressing challenges related to spatial complexity and reward sensitivity.
๐ ๏ธ Research Methods:
– The study evaluates multiple aspects such as reward designs, RL algorithms (including GRPO variants), and benchmarks (introducing MME-3DR) to enhance the 3D generation process. Proposed Hi-GRPO for hierarchical optimization.
๐ฌ Research Conclusions:
– Developed AR3D-R1, the first RL-enhanced text-to-3D model, demonstrating expert performance in refining shapes and textures. The research provides insights into RL-driven reasoning for 3D generation, with release of the source code for broader application.
๐ Paper link: https://huggingface.co/papers/2512.10949

4. OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
๐ Keywords: Outcome-based Process Verifier, iterative active learning, Rejection Fine-Tuning, state-of-the-art performance, Reinforcement Learning with Verifiable Rewards
๐ก Category: Knowledge Representation and Reasoning
๐ Research Objective:
– To enhance the verification of complex reasoning chains in large language models by developing an Outcome-based Process Verifier (OPV).
๐ ๏ธ Research Methods:
– Employing an iterative active learning framework combined with expert annotations and Rejection Fine-Tuning to improve OPV performance while reducing annotation costs.
๐ฌ Research Conclusions:
– OPV significantly outperforms existing models, demonstrating an F1 score of 83.1 on OPV-Bench, and effectively detects false positives. It collaborates well with policy models, increasing accuracy as compute budget scales.
๐ Paper link: https://huggingface.co/papers/2512.10756

5. Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
๐ Keywords: LLM agents, AI for geometry, InternGeometry, Complexity-Boosting Reinforcement Learning, symbolic engine
๐ก Category: Knowledge Representation and Reasoning
๐ Research Objective:
– Develop a medalist-level LLM agent for geometry that surpasses human performance on International Mathematical Olympiad (IMO) geometry problems using heuristic-driven propositions and verification.
๐ ๏ธ Research Methods:
– Implement iterative proposition verification with a symbolic engine and dynamic memory mechanism.
– Introduce Complexity-Boosting Reinforcement Learning to enhance problem complexity across training stages.
๐ฌ Research Conclusions:
– InternGeometry, based on InternThinker-32B, solves 44 out of 50 IMO geometry problems, outperforming AlphaGeometry 2 with minimal training data.
– Demonstrates the potential of LLM agents on expert-level geometry tasks, proposing novel auxiliary constructions unavailable in human solutions.
๐ Paper link: https://huggingface.co/papers/2512.10534

6. MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
๐ Keywords: MoCapAnything, Category-Agnostic Motion Capture (CAMoCap), 3D joint trajectories, cross-species retargeting, skeletal animations
๐ก Category: Computer Vision
๐ Research Objective:
– To develop MoCapAnything, a reference-guided framework for reconstructing rotation-based animations from monocular videos for arbitrary rigged 3D assets, facilitating Category-Agnostic Motion Capture.
๐ ๏ธ Research Methods:
– MoCapAnything utilizes a factorized framework with three learnable modules and a lightweight inverse kinematics stage, including a Reference Prompt Encoder, a Video Feature Extractor, and a Unified Motion Decoder.
๐ฌ Research Conclusions:
– MoCapAnything showcases high-quality skeletal animations and meaningful cross-species retargeting, enabling scalable, prompt-driven 3D motion capture across diverse assets.
๐ Paper link: https://huggingface.co/papers/2512.10881
7. BEAVER: An Efficient Deterministic LLM Verifier
๐ Keywords: BEAVER, Large Language Models, Sound Probability Bounds, Constraint Verification, AI-generated summary
๐ก Category: Natural Language Processing
๐ Research Objective:
– To provide a deterministic and sound framework for verifying constraints in large language models using BEAVER.
๐ ๏ธ Research Methods:
– BEAVER systematically explores the generation space with novel data structures, maintaining provably sound bounds.
๐ฌ Research Conclusions:
– BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high-risk instances compared to baseline methods, facilitating precise risk assessment.
๐ Paper link: https://huggingface.co/papers/2512.05439

8. From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
๐ Keywords: Vision-Language Models, Microscopic Spatial Intelligence, AI-Generated Summary, scientific AGI
๐ก Category: Multi-Modal Learning
๐ Research Objective:
– The paper introduces Microscopic Spatial Intelligence (MiSI) as a new concept to evaluate Vision-Language Models’ abilities to understand spatial relationships of microscopic entities.
๐ ๏ธ Research Methods:
– A systematic benchmark framework, MiSI-Bench, was created featuring over 163,000 question-answer pairs and 587,000 images from approximately 4,000 molecular structures to assess the domain.
๐ฌ Research Conclusions:
– Current state-of-the-art VLMs perform below human level in most tasks, but a fine-tuned 7B model shows promise, outperforming humans in spatial transformation tasks, despite struggling with scientifically-grounded tasks.
๐ Paper link: https://huggingface.co/papers/2512.10867

9. VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
๐ Keywords: VQRAE, Vector Quantization, Unified Tokenizer, Multimodal Understanding, Discrete Tokens
๐ก Category: Multi-Modal Learning
๐ Research Objective:
– To develop VQRAE, a Vector Quantization Representation AutoEncoder, unifying multimodal understanding, generation, and reconstruction.
๐ ๏ธ Research Methods:
– Utilization of a unified tokenizer with continuous semantic features and discrete tokens, employing a symmetric ViT decoder and a two-stage training strategy involving a high-dimensional semantic VQ codebook.
๐ฌ Research Conclusions:
– VQRAE demonstrates competitive performance in visual understanding, generation, and reconstruction benchmarks, achieving 100% utilization ratio in the semantic VQ codebook and promising scaling property in the autoregressive paradigm.
๐ Paper link: https://huggingface.co/papers/2511.23386

10. Thinking with Images via Self-Calling Agent
๐ Keywords: Self-Calling Chain-of-Thought, visual reasoning, language-only CoT, group-relative policy optimization, parameter-sharing subagents
๐ก Category: Knowledge Representation and Reasoning
๐ Research Objective:
– Introduce Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm, to enhance performance and efficiency in visual reasoning through language-only CoT with self-calling subagents.
๐ ๏ธ Research Methods:
– Employ sCoT by decomposing complex tasks into atomic subtasks with self-calling subagents in an isolated context, utilizing group-relative policy optimization to reinforce reasoning behaviors.
๐ฌ Research Conclusions:
– sCoT improves the overall reasoning performance by up to 1.9% and reduces GPU hours by approximately 75% compared to existing strong baseline approaches.
๐ Paper link: https://huggingface.co/papers/2512.08511

11. Evaluating Gemini Robotics Policies in a Veo World Simulator
๐ Keywords: Generative Evaluation System, Frontier Video Model, Out-of-Distribution Generalization, Safety Checks, AI Native
๐ก Category: Robotics and Autonomous Systems
๐ Research Objective:
– To demonstrate the comprehensive application of frontier video models (Veo) in evaluating robotics policies across nominal performance, out-of-distribution generalization, and safety concerns.
๐ ๏ธ Research Methods:
– Development of a generative evaluation system built on a frontier video foundation model to enable realistic scene simulations and evaluate robot policy conditions, supporting multi-view consistency and generative image-editing.
๐ฌ Research Conclusions:
– The system effectively predicts the performance of robotic policies in varied conditions and validates safety, through 1600+ real-world evaluations, demonstrating its potential to enhance policy assessment in both safe and novel scenarios.
๐ Paper link: https://huggingface.co/papers/2512.10675

12. StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
๐ Keywords: StereoSpace, viewpoint-conditioned diffusion, perceptual comfort, geometric consistency
๐ก Category: Generative Models
๐ Research Objective:
– Introduce StereoSpace, a diffusion-based framework for generating stereo images without explicit depth or warping.
๐ ๏ธ Research Methods:
– Utilize viewpoint conditioning to model geometry and infer correspondences with a canonical rectified space.
– Implement an end-to-end evaluation protocol devoid of ground truth geometry to ensure fair testing.
๐ฌ Research Conclusions:
– StereoSpace achieves superior performance in generating sharp parallax and robust stereo images, surpassing existing warp & inpaint and latent-warping methods.
– The framework establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
๐ Paper link: https://huggingface.co/papers/2512.10959

13. Stronger Normalization-Free Transformers
๐ Keywords: Derf, Dynamic Tanh, normalization, AI-generated summary
๐ก Category: Machine Learning
๐ Research Objective:
– To explore the design of point-wise normalization functions, with a focus on surpassing the performance of Dynamic Tanh (DyT) while improving generalization.
๐ ๏ธ Research Methods:
– Study intrinsic properties of point-wise functions and perform a large-scale search to discover effective function designs.
– Introduce Derf based on the rescaled Gaussian cumulative distribution function as an optimized point-wise normalization function.
๐ฌ Research Conclusions:
– Derf outperforms existing normalization functions like LayerNorm, RMSNorm, and DyT across various domains including vision, speech representation, and DNA sequence modeling.
– Notable performance gains are primarily due to improved generalization, making Derf a practical option for normalization-free Transformer architectures.
๐ Paper link: https://huggingface.co/papers/2512.10938

14. MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
๐ Keywords: 4D Gaussian Splatting, MoRel, Anchor Relay-based Bidirectional Blending, Memory-efficient, Temporal coherence
๐ก Category: Computer Vision
๐ Research Objective:
– The study introduces MoRel, a novel 4D Gaussian Splatting framework, aimed at improving dynamic video rendering through efficient memory usage and handling long-range motion.
๐ ๏ธ Research Methods:
– The research employs an Anchor Relay-based Bidirectional Blending mechanism to ensure temporally consistent modeling and a Feature-variance-guided Hierarchical Densification scheme to enhance rendering quality and manage feature-variance levels.
๐ฌ Research Conclusions:
– MoRel successfully achieves temporally coherent and flicker-free long-range 4D reconstruction with bounded memory usage, demonstrating improved scalability and efficiency in Gaussian-based dynamic scene representations.
๐ Paper link: https://huggingface.co/papers/2512.09270

15. H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
๐ Keywords: video-to-video translation, generative model, unpaired robot videos, inpainting, temporal coherence
๐ก Category: Robotics and Autonomous Systems
๐ Research Objective:
– To develop a framework that translates human-object interaction videos into realistic robot manipulation videos using a generative model and unpaired data.
๐ ๏ธ Research Methods:
– Implements a transferable representation with inpainting and visual cues to generate motion-consistent robot videos.
– Fine-tunes a SOTA video diffusion model using in-context learning for temporal coherence and leveraging rich prior knowledge.
๐ฌ Research Conclusions:
– The proposed framework achieves significantly more realistic and physically grounded robot motions compared to baselines, offering a promising direction for advancing robot learning from unlabeled human videos.
๐ Paper link: https://huggingface.co/papers/2512.09406
16. Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
๐ Keywords: Spatiotemporal Reasoning, Multimodal Large Language Models, Video Question Answering, Video Toolkit, STAR Framework
๐ก Category: Multi-Modal Learning
๐ Research Objective:
– The objective is to enhance multimodal large language models with a spatiotemporal reasoning framework for improved video question answering capabilities.
๐ ๏ธ Research Methods:
– Implementation of a Spatiotemporal Reasoning Framework (STAR) to strategically schedule tools for spatial and temporal understanding.
– Integration of a comprehensive Video Toolkit to strengthen the reasoning capabilities of foundation models within the video domain.
๐ฌ Research Conclusions:
– The STAR framework significantly enhances the performance of GPT-4o on VideoMME by 8.2% and LongVideoBench by 4.6%.
– The approach marks a critical step towards developing autonomous and intelligent video analysis assistants.
๐ Paper link: https://huggingface.co/papers/2512.10359

17. The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
๐ Keywords: FACTS Leaderboard, factual accuracy, language models, automated judge models, AI-generated summary
๐ก Category: Natural Language Processing
๐ Research Objective:
– Introduce The FACTS Leaderboard suite to evaluate the factual accuracy of language models across diverse scenarios.
๐ ๏ธ Research Methods:
– Utilizes four sub-leaderboards to assess different aspects: image-based questions, closed-book factoid questions, information-seeking with search API, and document-grounded long-form responses.
๐ฌ Research Conclusions:
– Provides a comprehensive and robust measure of a model’s factuality by using automated judge models and maintaining both public and private leaderboard splits for integrity and external participation.
๐ Paper link: https://huggingface.co/papers/2512.10791

18. Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
๐ Keywords: Federated Self-Evolution, Privacy Constraints, Low-Rank Subspace, Parameter-Efficient Fine-Tuning
๐ก Category: Reinforcement Learning
๐ Research Objective:
– The objective is to enhance LLM agents in privacy-constrained environments using a Federated Self-Evolution framework.
๐ ๏ธ Research Methods:
– Introduces a local evolution-global aggregation paradigm with parameter-efficient fine-tuning locally and global aggregation within a low-rank subspace to overcome federated learning challenges.
๐ฌ Research Conclusions:
– Experiments demonstrate approximately 18% improvement in task success rates over federated baselines, validating the frameworkโs effectiveness in cross-environment knowledge transfer.
๐ Paper link: https://huggingface.co/papers/2512.08870

19. ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
๐ Keywords: ReViSE, reasoning-informed video editing, visual fidelity, vision-language models, intrinsic feedback
๐ก Category: Computer Vision
๐ Research Objective:
– The study aims to bridge the gap between video model reasoning and visual editing by integrating reasoning capabilities with visual editing to enhance accuracy and fidelity.
๐ ๏ธ Research Methods:
– The methods involve developing the ReViSE framework, using a self-reflective reasoning mechanism, and introducing the Reason-Informed Video Editing (RVE) task supported by the comprehensive RVE-Bench benchmark.
๐ฌ Research Conclusions:
– Extensive experiments show that ReViSE achieves a 32% improvement in editing accuracy and visual fidelity compared to state-of-the-art methods, demonstrating significant advancements in reasoning-informed video editing.
๐ Paper link: https://huggingface.co/papers/2512.09924

20. Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
๐ Keywords: Confucius Code Agent, AI software engineering, coding agents, Confucius SDK, industrial scale
๐ก Category: AI Systems and Tools
๐ Research Objective:
– The paper aims to develop the Confucius Code Agent (CCA), an open-source AI software engineer designed to operate at an industrial scale, addressing transparency, extensibility, and robust performance requirements.
๐ ๏ธ Research Methods:
– CCA is built using the Confucius SDK, which integrates Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK utilizes a unified orchestrator with hierarchical working memory and a meta-agent for continuous improvement through a build-test-improve loop.
๐ฌ Research Conclusions:
– The Confucius Code Agent demonstrates significant performance improvements on real-world software engineering tasks, achieving a Resolve@1 performance of 54.3% on SWE-Bench-Pro, thereby bridging the gap between research and production-grade AI systems.
๐ Paper link: https://huggingface.co/papers/2512.10398

21. Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
๐ Keywords: Omni-Attribute, image attribute encoder, attribute-specific representations, high-fidelity, compositional generation
๐ก Category: Computer Vision
๐ Research Objective:
– To introduce Omni-Attribute, an open-vocabulary image attribute encoder for precise visual concept personalization and compositional generation.
๐ ๏ธ Research Methods:
– Curating semantically linked image pairs annotated with attributes.
– Utilizing a dual-objective training paradigm balancing generative fidelity with contrastive disentanglement.
๐ฌ Research Conclusions:
– The resulting embeddings are effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
๐ Paper link: https://huggingface.co/papers/2512.10955

22. X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
๐ Keywords: Embodied AI, Generative Video Editing, Human-to-Humanoid Translation, Video-to-Video Structure, Unreal Engine
๐ก Category: Robotics and Autonomous Systems
๐ Research Objective:
– Develop X-Humanoid to create extensive datasets for training embodied AI models via generative video editing, translating human actions into humanoid robot actions.
๐ ๏ธ Research Methods:
– Utilize the Wan 2.2 model adapted to a video-to-video structure and finetune it for human-to-humanoid translation. Implement a data creation pipeline with Unreal Engine to generate paired synthetic videos.
๐ฌ Research Conclusions:
– Created over 3.6 million “robotized” humanoid video frames, with 69% of users rating the method superior in motion consistency and 62.1% in embodiment correctness compared to existing baselines.
๐ Paper link: https://huggingface.co/papers/2512.04537
23. MOA: Multi-Objective Alignment for Role-Playing Agents
๐ Keywords: MOA, Reinforcement Learning, Multi-Objective Alignment, Thought-Augmented Rollout
๐ก Category: Reinforcement Learning
๐ Research Objective:
– The study aims to optimize multiple dimensions of role-playing agents (RPAs) using a novel reinforcement-learning framework called MOA (Multi-Objective Alignment), enhancing performance across diverse scenarios and complex conversations.
๐ ๏ธ Research Methods:
– Implementation of a multi-objective optimization strategy that simultaneously trains on fine-grained rubrics.
– Integration of thought-augmented rollout with off-policy guidance to address diversity and quality issues in model outputs.
๐ฌ Research Conclusions:
– MOA allows an 8B model to match or surpass strong baselines such as GPT-4o and Claude in various dimensions, proving its potential in building RPAs that effectively handle role knowledge, persona style, diverse scenarios, and complex conversations.
๐ Paper link: https://huggingface.co/papers/2512.09756

24. DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
๐ Keywords: DuetSVG, multimodal model, SVG generation, image tokens, test-time scaling strategy
๐ก Category: Multi-Modal Learning
๐ Research Objective:
– Introduce DuetSVG, a unified multimodal model that generates both image and SVG tokens to enhance SVG quality.
๐ ๏ธ Research Methods:
– DuetSVG is trained on image and SVG datasets and utilizes a novel test-time scaling strategy for improved SVG decoding.
๐ฌ Research Conclusions:
– DuetSVG outperforms existing methods by producing visually faithful, semantically aligned, and syntactically clean SVGs.
๐ Paper link: https://huggingface.co/papers/2512.10894

25.
