AI Native Daily Paper Digest – 20251104

1. Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

๐Ÿ”‘ Keywords: Ling 2.0, Mixture-of-Experts, sparse activation, reasoning accuracy, computational efficiency

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– Present Ling 2.0, a reasoning-oriented language model that scales from billions to a trillion parameters, emphasizing high sparsity and computational efficiency.

๐Ÿ› ๏ธ Research Methods:

– Utilize the Mixture-of-Experts paradigm along with innovative training techniques, including high-sparsity MoE, reinforcement-based fine-tuning, and full-scale FP8 training.

๐Ÿ’ฌ Research Conclusions:

– Ling 2.0 establishes a new Pareto frontier by aligning sparse activation with reasoning objectives, thus providing an efficient foundation for future reasoning and thinking models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.22115

2. Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

๐Ÿ”‘ Keywords: Agent-REINFORCE, multi-LLM collaboration graph, Test-Time Scaling, probabilistic graph optimization, textual gradient

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– Optimize multi-LLM collaboration graphs for better sample efficiency and search performance under accuracy and latency constraints.

๐Ÿ› ๏ธ Research Methods:

– Reformulated the problem as probabilistic graph optimization and developed the Agent-REINFORCE framework that mirrors the REINFORCE pipeline by using sampling-feedback-update method.

๐Ÿ’ฌ Research Conclusions:

– Agent-REINFORCE outperforms traditional and LLM-based baselines in sample efficiency and search performance, successfully identifying optimal graphs for accuracy and inference latency objectives.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00086

3. The Underappreciated Power of Vision Models for Graph Structural Understanding

๐Ÿ”‘ Keywords: Vision models, Graph Neural Networks, GraphAbstract, scale-invariant reasoning, global structural understanding

๐Ÿ’ก Category: Foundations of AI

๐ŸŒŸ Research Objective:

– To explore the potential of vision models for understanding global graph structures compared to Graph Neural Networks (GNNs).

๐Ÿ› ๏ธ Research Methods:

– Introduction of the GraphAbstract benchmark to evaluate models’ ability to perceive global graph properties like humans.

๐Ÿ’ฌ Research Conclusions:

– Vision models outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales.

– GNNs struggle with global pattern abstraction and degrade with increasing graph size, highlighting the underutilized capabilities of vision models for tasks that involve global topological awareness.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.24788

4. UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

๐Ÿ”‘ Keywords: AI Native, RGB-space, flow matching, physical consistency, relighting quality

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– The paper introduces UniLumos, a unified relighting framework that aims to enhance physical plausibility in image and video relighting by integrating RGB-space geometry feedback into a flow matching backbone.

๐Ÿ› ๏ธ Research Methods:

– Utilizes depth and normal maps for supervising the model to align lighting effects with scene structure while employing path consistency learning to reduce computational costs.

– Develops a six-dimensional annotation protocol for capturing illumination attributes and proposes LumosBench for evaluating lighting controllability via large vision-language models.

๐Ÿ’ฌ Research Conclusions:

– UniLumos demonstrates state-of-the-art relighting quality with improved physical consistency and achieves a 20x speedup in both image and video relighting processes.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01678

5. ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

๐Ÿ”‘ Keywords: Reciprocal cross-modal reasoning, Unified multimodal models, Visual generation quality, Symbolic reasoning, Verbal prompts

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To evaluate reciprocal cross-modal reasoning in unified multimodal models using a new benchmark called ROVER, which assesses how effectively these models integrate text and image understanding.

๐Ÿ› ๏ธ Research Methods:

– The study employs ROVER, a human-annotated benchmark with 1312 tasks and 1876 images, targeting reciprocal cross-modal reasoning through verbally and visually augmented tasks for image and verbal generation.

๐Ÿ’ฌ Research Conclusions:

– Cross-modal reasoning plays a crucial role in determining visual generation quality, with interleaved models outperforming non-interleaved ones. Additionally, models struggle with symbolic reasoning, showing a gap between interpreting perceptual concepts and generating visual abstractions for symbolic tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01163

6. PHUMA: Physically-Grounded Humanoid Locomotion Dataset

๐Ÿ”‘ Keywords: Motion Imitation, Humanoid Locomotion, Motion Capture Datasets, Physical Artifacts, Data Curation

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– The research aims to improve motion imitation for humanoid robotics by creating a large-scale and physically reliable locomotion dataset, called PHUMA, which addresses issues found in traditional human video data such as physical artifacts.

๐Ÿ› ๏ธ Research Methods:

– The methods involve using large-scale human video data with physics-constrained retargeting and careful data curation to eliminate physical artifacts. PHUMA ensures joint limits, ground contact, and eliminates foot skating to ensure physically reliable motions.

๐Ÿ’ฌ Research Conclusions:

– The PHUMA dataset significantly outperforms existing datasets like Humanoid-X and AMASS in imitating a variety of humanlike behaviors. The policies trained using PHUMA demonstrated substantial improvements in tests such as imitation of unseen motion and path following with pelvis-only guidance.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.26236

7. UniREditBench: A Unified Reasoning-based Image Editing Benchmark

๐Ÿ”‘ Keywords: reasoning-based image editing, multi-object interactions, game-world scenarios, multimodal dual-reference evaluation, synthetic dataset

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To address limitations of existing benchmarks in image editing by proposing UniREditBench, focusing on reasoning-based evaluation in both real-life and game-world scenarios.

๐Ÿ› ๏ธ Research Methods:

– Development of a comprehensive benchmark with 2,700 curated samples covering diverse scenarios and dimensions; introduction of multimodal dual-reference evaluation for improved reliability; construction of a large-scale synthetic dataset, UniREdit-Data-100K, for fine-tuning and evaluating models.

๐Ÿ’ฌ Research Conclusions:

– UniREditBench demonstrates improvements in handling complex image editing tasks requiring implicit reasoning, providing robust assessment across existing and new benchmarks through open-source and closed-source model evaluation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01295

8. World Simulation with Video Foundation Models for Physical AI

๐Ÿ”‘ Keywords: Physical AI, Cosmos-Predict2.5, Cosmos-Transfer2.5, Sim2Real, Text2World

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Introduce advanced Physical AI models, Cosmos-Predict2.5 and Cosmos-Transfer2.5, that unify text, image, and video generation and enhance video quality and instruction alignment.

๐Ÿ› ๏ธ Research Methods:

– Used a flow-based architecture and a Physical AI vision-language model, trained on 200 million curated video clips with reinforcement learning-based post-training.

๐Ÿ’ฌ Research Conclusions:

– Cosmos-Predict2.5 and Cosmos-Transfer2.5 establish themselves as versatile tools for scaling embodied intelligence, with open resources provided to lower adoption barriers and foster innovation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00062

9. ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

๐Ÿ”‘ Keywords: ToolScope, Multimodal Large Language Models, Visual Question Answering, Perceive tool, AI-generated summary

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The study aims to enhance multimodal large language models (MLLMs) for visual question answering by integrating external tools through the ToolScope framework.

๐Ÿ› ๏ธ Research Methods:

– ToolScope incorporates three primary components: Global Navigator for strategic guidance, Agentic Executor for integrating external tools like Search, Code, and Perceive for local perception, and Response Synthesizer for organizing reasoning outputs.

๐Ÿ’ฌ Research Conclusions:

– ToolScope exhibits strong generalization capabilities, with up to +6.69% performance improvement across four VQA benchmarks in diverse domains.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.27363

10. MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

๐Ÿ”‘ Keywords: MR-ALIGN, Meta-Reasoning, Large reasoning models, factuality, coherent reasoning trajectories

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The objective is to enhance the factuality of Large reasoning models by aligning their reasoning process with MR-ALIGN, improving accuracy and reducing misleading outcomes.

๐Ÿ› ๏ธ Research Methods:

– MR-ALIGN utilizes a Meta-Reasoning informed alignment framework to quantify state transition probabilities and construct a transition-aware implicit reward that promotes beneficial reasoning patterns while suppressing defective ones.

๐Ÿ’ฌ Research Conclusions:

– Empirical evaluations show that MR-ALIGN consistently improves accuracy and truthfulness across factual QA datasets and benchmarks, highlighting the importance of aligning the reasoning process itself in advancing factuality in Large reasoning models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.24794

11. OpenSIR: Open-Ended Self-Improving Reasoner

๐Ÿ”‘ Keywords: OpenSIR, Self-Play, Large Language Models, Mathematical Discovery, Open-Ended Learning

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– To enhance reasoning abilities in Large Language Models through a self-play framework that allows for open-ended problem generation and solving without external supervision.

๐Ÿ› ๏ธ Research Methods:

– Implementing OpenSIR, a self-play framework where LLMs alternate between teacher and student roles to generate and solve novel problems by focusing on difficulty and diversity.

๐Ÿ’ฌ Research Conclusions:

– OpenSIR significantly improves instruction models by advancing mathematical discovery and learning autonomously from basic to advanced mathematics.

– It demonstrated notable performance improvements in models like Llama-3.2-3B-Instruct and Gemma-2-2B-Instruct on benchmarks such as GSM8K and College Math.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00602

12. Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

๐Ÿ”‘ Keywords: Video Retrieval, Zero-Shot Generalization, Universal Video Retrieval Benchmark, Modality Pyramid

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To develop a framework with diagnostic benchmarks and data synthesis for achieving state-of-the-art zero-shot generalization in video retrieval.

๐Ÿ› ๏ธ Research Methods:

– Introduction of the Universal Video Retrieval Benchmark (UVRB) to diagnose capability gaps.

– Development of a scalable synthesis workflow generating 1.55 million high-quality data pairs.

– Creation of the Modality Pyramid curriculum for training the General Video Embedder.

๐Ÿ’ฌ Research Conclusions:

– The framework enables superior zero-shot generalization in video retrieval.

– Existing benchmarks inadequately predict general ability, with overlooked scenarios in retrieval being highlighted.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.27571

13. LongCat-Flash-Omni Technical Report

๐Ÿ”‘ Keywords: LongCat-Flash-Omni, Omni-modal Model, AI-generated summary, Audio-visual Interaction, Multimodal Perception

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To introduce and evaluate LongCat-Flash-Omni, a 560 billion parameter omni-modal model designed for real-time audio-visual interactions.

๐Ÿ› ๏ธ Research Methods:

– Curriculum-inspired progressive training strategy and modality-decoupled parallelism scheme to enhance multimodal capabilities.

๐Ÿ’ฌ Research Conclusions:

– LongCat-Flash-Omni demonstrates state-of-the-art performance on omni-modal benchmarks and excels in various modality-specific tasks, achieving low-latency real-time audio-visual interaction.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00279

14. TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

๐Ÿ”‘ Keywords: Visual Reasoning, Multimodal Models, TIR-Bench, Chain-of-Thought, Thinking-with-Images

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To develop and introduce TIR-Bench, a benchmark for assessing advanced visual reasoning in multimodal models, focusing on tasks that require tool use and thought processes involving images.

๐Ÿ› ๏ธ Research Methods:

– Evaluation of 22 multimodal large language models using the TIR-Bench benchmark across 13 diverse tasks, focusing on the integration of novel tool use in visual processing.

๐Ÿ’ฌ Research Conclusions:

– Demonstrates the universal challenge presented by TIR-Bench, indicating that strong performance necessitates genuine visual reasoning and thinking-with-images capabilities.

– A pilot study compares the effectiveness of direct versus agentic fine-tuning in enhancing model performance.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01833

15. EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

๐Ÿ”‘ Keywords: EBT-Policy, Energy-Based Models, Robotics, Zero-Shot Recovery, Uncertainty-Aware Inference

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– The research introduces EBT-Policy, an energy-based architecture, aiming to solve core challenges in robotic and real-world settings by improving robustness and reducing computational cost.

๐Ÿ› ๏ธ Research Methods:

– EBT-Policy leverages energy landscapes to enhance robustness and reduce exposure bias, with the model demonstrating effectiveness both in high-dimensional spaces and real-world tasks.

๐Ÿ’ฌ Research Conclusions:

– EBT-Policy consistently outperforms diffusion-based policies in robotic tasks, requiring less training and computation, and shows emergent capabilities such as zero-shot recovery and dynamic compute allocation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.27545

16. NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

๐Ÿ”‘ Keywords: NaviTrace, Visual Question Answering, robotic navigation, Vision-language models, semantic-aware trace score

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– The paper introduces NaviTrace, a benchmark designed to evaluate robotic navigation capabilities through a semantic-aware trace score.

๐Ÿ› ๏ธ Research Methods:

– Evaluation of eight state-of-the-art Vision-language models using a newly introduced semantic-aware trace score across 1000 scenarios and more than 3000 expert traces.

๐Ÿ’ฌ Research Conclusions:

– The study finds a significant gap between model and human performance due to poor spatial grounding and goal localization, establishing NaviTrace as a scalable benchmark for real-world robotic navigation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.26909

17. Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

๐Ÿ”‘ Keywords: MeasureBench, Vision-Language Models, Indicator Localization, Reinforcement Learning, Fine-Grained Spatial Grounding

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The paper introduces MeasureBench to evaluate the ability of vision-language models to read measurements from images, both real-world and synthesized, aiming to identify challenges in this domain.

๐Ÿ› ๏ธ Research Methods:

– They utilize a data synthesis pipeline to procedurally generate various types of gauges with controllable visual aspects and conduct evaluations on popular proprietary and open-weight vision-language models.

๐Ÿ’ฌ Research Conclusions:

– Vision-language models struggle with indicator localization, leading to significant errors despite correct textual reasoning. Preliminary experiments with reinforcement learning yielded positive results on synthetic data, but not on real-world images, highlighting a limitation in fine-grained spatial grounding.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.26865

18. left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand
Rebus Puzzles

๐Ÿ”‘ Keywords: Rebus Puzzles, Vision-Language Models, structured reasoning, in-context example selection

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To enhance the performance of Vision-Language Models in solving Rebus Puzzles through structured reasoning and optimal example selection.

๐Ÿ› ๏ธ Research Methods:

– Development of a benchmark with 1,333 diverse Rebus Puzzles covering 18 categories.

– Proposal of RebusDescProgICE, a model-agnostic framework incorporating unstructured descriptions and code-based structured reasoning.

๐Ÿ’ฌ Research Conclusions:

– The proposed framework improves performance by 2.1-4.1% for closed-source and 20-30% for open-source Vision-Language Models compared to existing methods.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01340

19. Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

๐Ÿ”‘ Keywords: Viewpoint Learning, Multimodal Large Language Models, 3D reasoning, Supervised Fine-Tuning, Reinforcement Learning

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The paper aims to enhance the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) for robust 3D reasoning tasks.

๐Ÿ› ๏ธ Research Methods:

– Introduces Viewpoint Learning with a two-stage fine-tuning strategy and a hybrid cold-start initialization method.

– Utilizes a dataset, Viewpoint-100K, comprising diverse object-centric image pairs and question-answer sets.

– Employs Supervised Fine-Tuning (SFT) for foundational knowledge injection followed by generalization through Reinforcement Learning with the Group Relative Policy Optimization (GRPO) algorithm.

๐Ÿ’ฌ Research Conclusions:

– The methods substantially improve the spatial reasoning ability of MLLMs, benefiting both in-domain and out-of-domain tasks.

– Suggests the development of foundational spatial skills in MLLMs to support advancements in robotics, autonomous systems, and 3D scene understanding.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01618

20. Trove: A Flexible Toolkit for Dense Retrieval

๐Ÿ”‘ Keywords: Trove, retrieval toolkit, data management, multi-node execution, inference pipeline

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The objective of the research is to introduce Trove, a novel open-source retrieval toolkit designed to streamline data management and improve research experimentation with customizable and efficient on-the-fly processing.

๐Ÿ› ๏ธ Research Methods:

– Trove utilizes efficient data management techniques to handle and process retrieval datasets dynamically with minimal code. It supports customizable components and offers a low-code pipeline for evaluation and hard negative mining.

๐Ÿ’ฌ Research Conclusions:

– Trove significantly reduces memory consumption and demonstrates increased inference speed aligned with the number of processing nodes. It facilitates exploratory research by simplifying retrieval experiments and supporting extensive customization.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01857

21. Data-Efficient RLVR via Off-Policy Influence Guidance

๐Ÿ”‘ Keywords: Reinforcement Learning, Verifiable Rewards, Influence Functions, Off-Policy Estimation, Sparse Random Projection

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– To enhance data selection in Reinforcement Learning with Verifiable Rewards (RLVR) and improve the efficiency of training large language models.

๐Ÿ› ๏ธ Research Methods:

– Proposed a theoretically-grounded approach leveraging influence functions to estimate data contribution and introduced off-policy influence estimation and sparse random projection for efficient computation.

๐Ÿ’ฌ Research Conclusions:

– Developed Curriculum RL with Off-Policy Influence guidance (CROPI), which significantly accelerates training, reducing data usage and demonstrating the potential of influence-based data selection in RLVR.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2510.26491

22. Towards Robust Mathematical Reasoning

๐Ÿ”‘ Keywords: IMO-Bench, Mathematical Reasoning, Foundation Models, Gemini Deep Think

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The objective of the study is to advance the mathematical reasoning capabilities of foundation models using a specially designed suite of benchmarks called IMO-Bench, focused on IMO-level problems.

๐Ÿ› ๏ธ Research Methods:

– IMO-Bench, including IMO-AnswerBench and IMO-Proof Bench, along with detailed grading guidelines, are used to evaluate models’ performance on mathematical reasoning tasks with both short answer and proof-writing capabilities.

๐Ÿ’ฌ Research Conclusions:

– The study achieved gold-level performance with Gemini Deep Think at IMO 2025, with the model outperforming previous models significantly on both IMO-AnswerBench and IMO-Proof Bench. The results show strong correlation with human evaluations, utilizing IMO-GradingBench for further advancements.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01846

23. MotionStream: Real-Time Video Generation with Interactive Motion Controls

๐Ÿ”‘ Keywords: MotionStream, text-to-video model, motion control, real-time streaming, sliding-window causal attention

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The research seeks to enable real-time video generation with sub-second latency, reaching up to 29 frames per second (FPS) through MotionStream, incorporating motion control into text-to-video models.

๐Ÿ› ๏ธ Research Methods:

– The approach involves distilling a bidirectional text-to-video model with motion control into a causal student using Self Forcing with Distribution Matching Distillation alongside sliding-window causal attention with attention sinks.

๐Ÿ’ฌ Research Conclusions:

– MotionStream delivers state-of-the-art results in motion following and video quality with significantly reduced latency, uniquely supporting infinite-length streaming. It allows for an interactive experience where users can control video elements in real-time.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01266

24. How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

๐Ÿ”‘ Keywords: Video Generation, Surgical AI, Surgical Plausibility Pyramid, Zero-shot Prediction, Visual Perceptual Plausibility

๐Ÿ’ก Category: AI in Healthcare

๐ŸŒŸ Research Objective:

– To address the gap between visual plausibility and causal understanding in surgical AI models by introducing the SurgVeo benchmark and the Surgical Plausibility Pyramid.

๐Ÿ› ๏ธ Research Methods:

– The study presented SurgVeo, an expert-curated benchmark, and evaluated the Veo-3 model on a zero-shot prediction task using surgical clips assessed by surgeons through the four-tiered SPP framework.

๐Ÿ’ฌ Research Conclusions:

– The analysis underscores a “plausibility gap” where the Veo-3 model excels in visual mimicry but fails in demonstrating causal understanding, particularly at higher tiers of the SPP, highlighting directions for future AI model improvements in healthcare.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01775

25. UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

๐Ÿ”‘ Keywords: generative multimodal embedding, enhancement, reasoning-driven generation, reinforcement learning, inference-time scalability

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To pioneer the exploration of generative embeddings by unifying embedding tasks within a generative paradigm, enhancing performance through reasoning-driven generation and reinforcement learning.

๐Ÿ› ๏ธ Research Methods:

– Introduces UME-R1, a framework with a two-stage training strategy: cold-start supervised fine-tuning to equip with reasoning capabilities, and reinforcement learning to optimize generative embedding quality.

๐Ÿ’ฌ Research Conclusions:

– Generative embeddings achieve substantial performance gains over discriminative models.

– The combination of discriminative and generative embeddings surpasses the performance of either type alone.

– Reinforcement learning effectively enhances generative embeddings establishing a scalable optimization paradigm.

– Repeated sampling at inference boosts task coverage, highlighting scalability potential.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00405

26. Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

๐Ÿ”‘ Keywords: Unified Diffusion VLA, Joint Discrete Denoising Diffusion Process, vision-language-action, faster inference, hybrid attention mechanism

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The research aims to enhance vision-language-action tasks through a synchronous denoising process that integrates multiple modalities for joint understanding, generation, and action.

๐Ÿ› ๏ธ Research Methods:

– The study introduces the Unified Diffusion VLA model and JD3P, employing a unified tokenized space and hybrid attention mechanism, along with a two-stage training pipeline to optimize performance and efficiency.

๐Ÿ’ฌ Research Conclusions:

– The proposed approach achieves state-of-the-art results on several benchmarks with significantly faster inference compared to autoregressive methods, demonstrating effectiveness in both theoretical and real-world evaluations.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01718

27. GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

๐Ÿ”‘ Keywords: GUI Grounding, MLLMs, Attention-based, Coordinate-free, Data Efficiency

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Develop an attention-based and coordinate-free framework, GUI-AIMA, to enhance GUI grounding with minimal training data.

๐Ÿ› ๏ธ Research Methods:

– Utilize alignment of MLLM attention with patch-wise signals and employ multi-head aggregation on query-visual attention matrices for adaptive signal calculation.

๐Ÿ’ฌ Research Conclusions:

– GUI-AIMA achieves state-of-the-art performance with exceptional data efficiency, demonstrating an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G using only 85k screenshots.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.00810

28. Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

๐Ÿ”‘ Keywords: Vote-in-Context, Vision-Language Models, zero-shot retrieval, fusion, list-wise reranking

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Introduce Vote-in-Context, a training-free framework for zero-shot reranking and fusion in cross-modal video retrieval using Vision-Language Models.

๐Ÿ› ๏ธ Research Methods:

– Serialize content evidence and retriever metadata within Vision-Language Model prompts.

– Utilize a compact serialization map called S-Grid, representing each video as an image grid, optionally paired with subtitles.

– Evaluate ViC as both a single-list reranker and an ensemble fuser across benchmarks such as ActivityNet and VATEX.

๐Ÿ’ฌ Research Conclusions:

– ViC outperforms previous baselines in zero-shot retrieval scenarios, notably improving Recall@1 scores and establishing new state-of-the-art performance.

– The framework effectively handles complex visual, temporal, and textual signals, demonstrating its adaptability and robustness.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01617

29. AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

๐Ÿ”‘ Keywords: Large Language Models, Cyber Threat Intelligence, AthenaBench, Risk Mitigation, Reasoning Capabilities

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The research aims to enhance the benchmarking of Large Language Models in the context of Cyber Threat Intelligence by developing AthenaBench, which improves dataset quality and evaluation metrics.

๐Ÿ› ๏ธ Research Methods:

– The study involves extending CTIBench with AthenaBench, evaluating twelve LLMs, including high-end proprietary and open-source models, across various CTI tasks, focusing on reasoning-intensive tasks like threat actor attribution and risk mitigation.

๐Ÿ’ฌ Research Conclusions:

– Proprietary LLMs showcase stronger capabilities yet fall short in reasoning-intensive tasks, while open-source models perform even worse, highlighting the need for models specifically designed for CTI automation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01144

30. Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

๐Ÿ”‘ Keywords: Parametric Knowledge, Context Knowledge, Natural Language Explanations, Large Language Models, knowledge interactions

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– This study proposes a novel rank-2 projection subspace to analyze the multi-step interactions between Parametric Knowledge and Context Knowledge in Large Language Models.

๐Ÿ› ๏ธ Research Methods:

– The research utilizes a rank-2 subspace approach for the first time to disentangle and assess the contributions of PK and CK across longer NLE sequences. Experiments were conducted on four QA datasets using three instruction-tuned Large Language Models.

๐Ÿ’ฌ Research Conclusions:

– The findings reveal that diverse knowledge interactions are inadequately represented in a rank-1 subspace but are effectively captured using the rank-2 projection. Hallucinated NLEs tend to align with PK, while context-faithful NLEs balance both PK and CK. Additionally, Chain-of-Thought prompting shifts NLEs towards CK by reducing reliance on PK.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2511.01706

Blank Form (#4)
[email protected]

About

Ecosystem

Copyright 2025 AI Native Foundationยฉ . All rights reserved.โ€‹