AI Native Daily Paper Digest – 20260528

1. Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

๐Ÿ”‘ Keywords: Generative multi-agent world model, Simplex Rotary Agent Encoding, Sparse Hub Attention, permutation-symmetric, action-responsive generation

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The study aims to develop a generative multi-agent world model for interactive video generation that supports scalable, permutation-symmetric interaction among multiple agents.

๐Ÿ› ๏ธ Research Methods:

– The model uses Simplex Rotary Agent Encoding to represent agents without a fixed order. It employs Sparse Hub Attention to efficiently manage interactions, reducing attention costs significantly.

๐Ÿ’ฌ Research Conclusions:

– The model enhances video fidelity, action controllability, and consistency among agents. It efficiently scales from two to four players with no additional training, outperforming existing models in multiplayer virtual environments.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28816

2. Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

๐Ÿ”‘ Keywords: Vision-language models, Extended reasoning, AXPO, Tool use, Thinking-Acting Gap

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The paper addresses the challenges faced by agents using vision-language models with extended reasoning, specifically in tool utilization, through a method called AXPO.

๐Ÿ› ๏ธ Research Methods:

– AXPO, or Agent eXplorative Policy Optimization, optimizes thinking prefixes and resamples tool calls to improve performance, using uncertainty-based prefix selection.

๐Ÿ’ฌ Research Conclusions:

– The integration of AXPO with SFT outperforms traditional methods across multiple benchmarks, providing superior results with fewer parameters.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28774

3. Self-Improving Language Models with Bidirectional Evolutionary Search

๐Ÿ”‘ Keywords: Bidirectional Evolutionary Search, language model generation, forward candidate evolution, backward goal decomposition

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The primary goal is to enhance language model generation by integrating forward candidate evolution with backward goal decomposition, overcoming traditional search method limitations.

๐Ÿ› ๏ธ Research Methods:

– Introduces Bidirectional Evolutionary Search (BES) incorporating a search framework coupled with forward evolution operators and backward goal decomposition, offering dense intermediate feedback for improved candidate generation.

๐Ÿ’ฌ Research Conclusions:

– BES demonstrates consistent improvements in post-training tasks and excels in open problem-solving benchmarks, outperforming existing frameworks in both average and best-case performance.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28814

4. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

๐Ÿ”‘ Keywords: DenoiseRL, Reinforcement Learning, Large Language Models, Incorrect Reasoning Traces, Exploration Efficiency

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– The aim of the study is to enhance reasoning capabilities in large language models by developing a reinforcement learning framework, DenoiseRL, that learns from incorrect reasoning traces without relying on strong teacher models or curated datasets.

๐Ÿ› ๏ธ Research Methods:

– DenoiseRL employs failure-oriented optimization to substitute external supervision, converting incorrect traces into opportunities for improvement, thereby making training scalable and resource-efficient.

๐Ÿ’ฌ Research Conclusions:

– DenoiseRL consistently outperforms strong on-policy RL baselines and promotes stronger self-corrective behavior, demonstrating a scalable alternative for improving reasoning in large language models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28421

5. MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

๐Ÿ”‘ Keywords: memory systems, operational information flow, memory evolution graphs, MemTraceBench, prompt optimization

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The study aims to address reliability issues in memory systems of large language models by introducing a novel tracing framework and automated fault attribution.

๐Ÿ› ๏ธ Research Methods:

– A framework is proposed that converts memory pipelines into executable memory evolution graphs for detailed tracing.

– MemTraceBench is established to systematically study memory failure modes in various representative memory systems.

– An automatic attribution method is developed to trace operation subgraphs and identify root causes of failures.

๐Ÿ’ฌ Research Conclusions:

– The research reveals systematic memory failures due to operation-level issues like information loss and misalignment.

– The fine-grained attribution signals are utilized to guide prompt optimization, enhancing end-task performance by up to 7.62%.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28732

6. ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

๐Ÿ”‘ Keywords: Autonomous research agents, verifiability framework, Chain-of-Evidence, ScientistOne, CoE Audit

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To address verifiability issues like fabricated citations and unreproducible results in autonomous research agents through a robust framework.

๐Ÿ› ๏ธ Research Methods:

– Implementation of a Chain-of-Evidence framework, development of the ScientistOne system to maintain evidence traceability, and execution of a CoE Audit for integrity checks.

๐Ÿ’ฌ Research Conclusions:

– ScientistOne system eliminates hallucinated references, achieves perfect score verification, and leads in method-code alignment, surpassing human expert performance in various tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26340

7. Rethinking Memory as Continuously Evolving Connectivity

๐Ÿ”‘ Keywords: Memory-augmented LLM agents, Connectivity-evolving memory, Heterogeneous graph, Feedback-driven refinement, Memory generalizability

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– Propose FluxMem, a dynamic memory framework to enhance performance in complex agentic environments through evolving memory topology.

๐Ÿ› ๏ธ Research Methods:

– Utilize a three-stage process involving initial connection formation, feedback-driven refinement, and long-term consolidation to refine memory topology.

๐Ÿ’ฌ Research Conclusions:

– FluxMem demonstrates state-of-the-art performance and strong adaptation in complex environments across three benchmarks, with open-source code available for further development.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28773

8. OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

๐Ÿ”‘ Keywords: Sparse attention, Parallelism, Quantization, Reinforcement learning, Text-to-video generation

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The main objective of the research is to develop OSP-Next, an efficient text-to-video generation model that balances high-quality video synthesis with reduced computational costs.

๐Ÿ› ๏ธ Research Methods:

– The model integrates key techniques including sparse attention, parallelism, quantization, and reinforcement learning.

– It utilizes a hybrid full-sparse attention architecture with Skiparse-2D Attention and proposes Sparse Sequence Parallelism (SSP) for improved efficiency and reduced communication volume.

– HiF8 quantization and Mix-GRPO post-training are applied to enhance model performance and stability.

๐Ÿ’ฌ Research Conclusions:

– OSP-Next demonstrates superior efficiency and performance, outperforming benchmarks like Wan2.1 in video generation tasks.

– The model achieves significant speedups on both NVIDIA H200 and Ascend 950PR GPUs, confirming its effectiveness across different hardware platforms.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28691

9. Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

๐Ÿ”‘ Keywords: IB-Score, Information Bottleneck theory, IB-TPO, exploration-exploitation trade-off, large language models

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– To develop a new metric, IB-Score, based on Information Bottleneck theory, for evaluating the exploration-exploitation balance in online reinforcement learning for large language models.

๐Ÿ› ๏ธ Research Methods:

– Introduction of IB-TPO framework utilizing a novel IB-guided tree sampling strategy to improve sampling efficiency and performance in online RL with a focus on fine-grained optimization and tree structure reuse for effective IB-Score Monte Carlo estimation.

๐Ÿ’ฌ Research Conclusions:

– IB-TPO significantly outperforms baseline methods such as GRPO by 2.9% to 3.6% and excels against other state-of-the-art online RL approaches, showing enhanced exploration-exploitation balance and improved trajectory efficiency.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28109

10. Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

๐Ÿ”‘ Keywords: Block-diffusion, Vision-Language-Action model, Structured token freezing, Speculative decoding, Test-time scaling

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– Fast-dDrive aims to enhance efficiency and accuracy in autonomous driving through a novel Vision-Language-Action model with structured token freezing and speculative decoding.

๐Ÿ› ๏ธ Research Methods:

– Incorporation of block-diffusion methodology with bidirectional refinement and strict causal ordering.

– Utilization of section-aware training and Scaffold Speculative Decoding to achieve high throughput with AR-equivalent quality.

– Implementation of low-overhead test-time scaling using stochastic trajectory rollouts and KV-cache.

๐Ÿ’ฌ Research Conclusions:

– Fast-dDrive redefines the speed-accuracy frontier for autonomous driving agents, achieving state-of-the-art performance on the WOD-E2E and nuScenes datasets.

– The framework significantly improves prediction accuracy and throughput speed, facilitating real-time on-vehicle deployment.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.23163

11. GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

๐Ÿ”‘ Keywords: GE-Sim 2.0, robotic manipulation, action-following fidelity, policy learning, closed-loop

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– Introduction of GE-Sim 2.0 as an enhanced closed-loop video world simulator for robotic manipulation, focusing on improving action-following fidelity and enabling scalable policy learning.

๐Ÿ› ๏ธ Research Methods:

– Utilization of real-world robot data for retraining and integration of modules for state decoding, world scoring, and accelerated inference to enhance the simulator’s performance.

๐Ÿ’ฌ Research Conclusions:

– GE-Sim 2.0 outperforms existing models with superior action fidelity and trajectory coverage, establishing itself as a practical tool for scalable evaluation and closed-loop learning in manipulation policies.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27491

12. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

๐Ÿ”‘ Keywords: Intrinsic Knowledge Dependence, closed-book accuracy, evidence-driven discovery, AI-generated summary

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To evaluate whether LLM search agents rely on internal knowledge rather than external evidence for answer verification.

๐Ÿ› ๏ธ Research Methods:

– Analysis was performed using BrowseComp with three diagnostics to assess the reliance on intrinsic knowledge, followed by the introduction of LiveBrowseComp, a new benchmark.

๐Ÿ’ฌ Research Conclusions:

– LLM search agents often depend on their intrinsic knowledge, performing poorly when external, answer-supporting evidence is removed.

– Static benchmarks may conflate intrinsic knowledge with true search capabilities.

– On LiveBrowseComp, evaluated agents showed reduced search-augmented scores and low accuracy, indicating intrinsic reliance.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28721

13. Less is More: Early Stopping Rollout for On-Policy Distillation

๐Ÿ”‘ Keywords: On-policy distillation, Teacher Decay, Early Stopping Rollout, Cascading Alignment, Sub-mode Commitment

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– The research aims to address the “Off-policy Teacher Decay” problem during on-policy distillation by proposing a method called Early Stopping Rollout.

๐Ÿ› ๏ธ Research Methods:

– Empirical analysis was conducted to verify the decay issue, and Early Stopping Rollout (ESR) was introduced to restrict rollout generation to initial response tokens, enhancing both efficiency and stability across models and tasks.

๐Ÿ’ฌ Research Conclusions:

– The study concludes that ESR surpasses existing full rollout performance, offering a solution with higher GPU efficiency and training stability. Notably, it introduces “Cascading Alignment” and “Sub-mode Commitment” effects that contribute to its superior performance, which sometimes even surpasses that of the teacher model.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27028

14. GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

๐Ÿ”‘ Keywords: GradSentry, backdoor attacks, spectral entropy, per-sample gradients, parameter-efficient fine-tuning

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– Detect and mitigate backdoor attacks in fine-tuning of large language models using spectral entropy analysis of per-sample gradients.

๐Ÿ› ๏ธ Research Methods:

– Implement GradSentry, a method based on spectral entropy of per-sample gradients to filter backdoor samples without clustering or training-specific modifications.

๐Ÿ’ฌ Research Conclusions:

– GradSentry is effective across varied poison ratios and supports multiple fine-tuning methods with minimal computational cost, demonstrating significant efficacy in detecting backdoor samples.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26574

15. VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

๐Ÿ”‘ Keywords: VibeSearch, LLM-based agents, multi-turn dialogue, long-context reasoning, structured knowledge construction

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To assess the performance of LLM-based agents on VibeSearch benchmark, highlighting the real user-agent collaboration in multi-turn dialogue.

๐Ÿ› ๏ธ Research Methods:

– Introduction of VibeSearchBench, a benchmark with 200 curated bilingual tasks across 20 domains, evaluated through a user simulator and graph-matching framework.

๐Ÿ’ฌ Research Conclusions:

– Frontier models show inadequate performance on VibeSearch, emphasizing the need for advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27882

16. Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

๐Ÿ”‘ Keywords: LLM agents, memory system, dialogue history, static prompts, TextGrad-based prompt optimization

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To enable reliable long-term interaction for LLM agents with an advanced memory system that supports deep reasoning and efficient retrieval.

๐Ÿ› ๏ธ Research Methods:

– Utilizes three coexisting memory representation granularities: raw dialogue segments, extracted atomic facts, and synthesized profiles.

– Implements TextGrad-based prompt optimization for lifelong evolution without parameter updating.

๐Ÿ’ฌ Research Conclusions:

– TriMem outperforms existing memory baselines, demonstrating enhanced efficiency in retrieving and reasoning over dialogue history.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.19952

17. How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

๐Ÿ”‘ Keywords: Cross-view spatial reasoning, Vision-language models, Unified multimodal models, Visual thinking, Out-of-domain generalization

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The study aims to enhance cross-view spatial reasoning in vision-language models by incorporating effective visual thinking techniques.

๐Ÿ› ๏ธ Research Methods:

– The proposed approach, View Dropout (VDrop), is a training intervention where parts of one input view are hidden to encourage reliance on pictorial thinking. Panoramic visual thinking is specifically evaluated for effectiveness.

๐Ÿ’ฌ Research Conclusions:

– Panoramic visual thinking combined with View Dropout is found to be both informative and learnable, leading to superior performance in out-of-domain generalization tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27310

18. Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

๐Ÿ”‘ Keywords: Scale-invariant K-Space, Image Generation, Super-Resolution, Diffusion Model, Unconditional Framework

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Unify image generation and continuous super-resolution in a single, unconditional framework using the SKILD model, which leverages scale invariance.

๐Ÿ› ๏ธ Research Methods:

– Design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, allowing the same trained reverse process to perform both tasks.

๐Ÿ’ฌ Research Conclusions:

– SKILD successfully achieves impressive results on unconditional CIFAR-10, performs 2x-8x super-resolution on ImageNet, and reconstructs critical models with perceptual metrics outperforming conditional models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26032

19. ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

๐Ÿ”‘ Keywords: ESC-Skills, Emotional Support, Intervention Units, Skills Bank, self-evolutionary refinement

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To propose a skill-centric framework, ESC-Skills, for discovering and self-evolving executable emotional support skills to improve interpretability and dialogue outcomes.

๐Ÿ› ๏ธ Research Methods:

– Utilizes Intervention Units to model support interactions and constructs an ESC-Skills Bank containing various skills, and employs a multi-profile self-evolutionary refinement framework for skill improvement.

๐Ÿ’ฌ Research Conclusions:

– The ESC-Skills framework enhances both response-level quality and dialogue-level emotional outcomes, providing interpretable and controllable support behaviors.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27908

20. Models That Know How Evaluations Are Designed Score Safer

๐Ÿ”‘ Keywords: AI safety evaluations, evaluation meta-knowledge, synthetic documents, safety benchmarks, memorization

๐Ÿ’ก Category: AI Ethics and Fairness

๐ŸŒŸ Research Objective:

– The study aims to explore how fine-tuning models on synthetic documents describing evaluation traits affects AI safety benchmark performance by enabling implicit recognition of evaluation-like contexts.

๐Ÿ› ๏ธ Research Methods:

– Models were fine-tuned on synthetic documents that describe evaluation traits, ensuring they learn to recognize evaluation-like contexts independently of memorization or explicit awareness. The fine-tuned models were then evaluated on six safety benchmarks.

๐Ÿ’ฌ Research Conclusions:

– The fine-tuned models exhibited significantly safer behavior compared to the base and control models across safety benchmarks. This improvement highlighted the role of evaluation meta-knowledge while presenting new challenges in detecting performance inflation that is independent of explicit memorization or awareness. These findings suggest important considerations for the design and interpretation of AI safety evaluations.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28591

21. Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

๐Ÿ”‘ Keywords: LLM agents, Formal verification, Specification autoformalization, Codeforces, Verus

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– This study examines whether LLM agents can accurately translate informal programming problems into formal specifications, ensuring their alignment with user intent.

๐Ÿ› ๏ธ Research Methods:

– Introduction of Verus-SpecBench and Verus-SpecGym to test the specs using Verus, bash, and the filesystem. The generated specs are executed as Rust code and tested against Codeforces official tests and adversarial cases.

๐Ÿ’ฌ Research Conclusions:

– Though spec autoformalization is feasible for frontier models, it is still fragile. Failure analysis indicates issues in model-specs such as omitting input assumptions and incorrect acceptance/rejection. LLM judges also miss a significant portion of failures.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26457

22. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

๐Ÿ”‘ Keywords: AgentFugue, collective reasoning, peer agents, reasoning hub, scaling out

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– To explore whether multiple peer agents, all targeting the same task, can enhance capability through a shared reasoning framework without centralized planning.

๐Ÿ› ๏ธ Research Methods:

– Implementation of AgentFugue, a collective reasoning framework using a shared reasoning hub.

– Development of a plug-in communication layer trained with supervised fine-tuning and end-to-end reinforcement learning.

๐Ÿ’ฌ Research Conclusions:

– AgentFugue allows for the conversion of isolated agent trajectories into a connected ecology of reusable intermediate reasoning, improving over strong baselines.

– The study suggests that collective reasoning can serve as a distinct source of capability gains by scaling out peer agent systems rather than merely increasing computational resources.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.24486

23. Advancing Creative Physical Intelligence in Large Multimodal Models

๐Ÿ”‘ Keywords: Large multimodal models, affordance-grounded alignment, creative problem-solving, visual evidence, hallucination

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To evaluate the creative problem-solving capabilities of large multimodal models in visually rich and physically constrained environments.

๐Ÿ› ๏ธ Research Methods:

– Introduction of MM-CreativityBench, a benchmark designed for affordance-grounded creative tool use.

– Utilization of affordance-grounded alignment and Direct Preference Optimization to enhance preference learning.

๐Ÿ’ฌ Research Conclusions:

– Large multimodal models often struggle due to a lack of grounded exploration rather than generative capabilities.

– Affordance-grounded alignment shows consistent improvements in identifying relevant entities and reducing hallucinations.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26396

24. Revealing Algorithmic Deductive Circuits for Logical Reasoning

๐Ÿ”‘ Keywords: Large Language Models, Attention Heads, Reasoning Steps, Causal Mediation Analysis, Chain-of-Thought

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The paper aims to localize the attention heads responsible for individual reasoning steps and characterize the information transferred among them in Large Language Models.

๐Ÿ› ๏ธ Research Methods:

– The researchers used symbolic-aided Chain-of-Thought prompting framework and causal mediation analysis techniques to analyze attention heads and token positions in reasoning processes.

๐Ÿ’ฌ Research Conclusions:

– The study finds that specialized attention heads are used for retrieving factual and rule-based information in sub-reasoning tasks, while higher layers integrate information and develop global reasoning strategies.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27824

25. Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

๐Ÿ”‘ Keywords: 3D correspondence, morphable object prior, HouseCorr3D, semantic 3D object understanding, canonical shape

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– The research aims to enable semantic 3D object understanding from single images by learning category-level 3D correspondence without explicit correspondence supervision.

๐Ÿ› ๏ธ Research Methods:

– The study introduces HouseCorr3D, a benchmark with large-scale dataset featuring 3D keypoint annotations and symmetry annotations to handle occlusions, and proposes the method Morpheus for learning morphable category-level shape priors by disentangling canonical shape, deformation, and object pose.

๐Ÿ’ฌ Research Conclusions:

– The research establishes new state-of-the-art results, showing that semantically meaningful 3D correspondences can emerge implicitly through the proposed methods, advancing semantic 3D object understanding without the need for direct correspondence supervision.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28257

26. Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

๐Ÿ”‘ Keywords: Neutrosophic Logic, Large Language Models, hyper-truth, epistemic uncertainty, ethical contradictions

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– Investigate the application of Neutrosophic Logic to Large Language Models (LLMs) for better representation of epistemic uncertainty and internal conflicts.

๐Ÿ› ๏ธ Research Methods:

– Conducted experiments with Neutrosophic Logic on four OpenAI GPT models across five linguistic phenomena and evaluated under three prompting strategies.

๐Ÿ’ฌ Research Conclusions:

– The neutrosophic approach allows a more nuanced representation of epistemic states, producing hyper-truth states and preserving truth values in complex scenarios, thus enhancing the transparency and ethical awareness of AI systems.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.24053

27. PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

๐Ÿ”‘ Keywords: PEAM, Mixture-of-Experts LoRA, continual learning, catastrophic forgetting, self-triggered consolidation

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To develop PEAM, a system that combines a deliberative LLM with a fast parametric module to enable continual learning without forgetting.

๐Ÿ› ๏ธ Research Methods:

– PEAM uses Mixture-of-Experts LoRA architecture for integrating deliberative and reflexive components and incorporates failure-correction trajectories.

๐Ÿ’ฌ Research Conclusions:

– PEAM enhances task performance and efficiency by improving skill retention and reducing the need for retrieval actions, showcasing improved task handling in environments like Minecraft.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27762

28. Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

๐Ÿ”‘ Keywords: Clark Hash, neural embeddings, scalar quantization, sentence-embedding, multilingual evaluation

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The primary goal is to develop a compact stateless codec, known as Clark Hash, to reduce the storage size of neural embeddings by 32x while sustaining high similarity accuracy.

๐Ÿ› ๏ธ Research Methods:

– The method involves normalizing each database vector, applying deterministic sparse signed Johnson-Lindenstrauss projections, and storing fixed-width scalar-quantized codes. Evaluation is performed using a Rust implementation and tested on multilingual sentence-similarity across 9,304 labeled pairs from 29 subsets.

๐Ÿ’ฌ Research Conclusions:

– Clark Hash achieves a 32x reduction in storage size, compressing vectors to 48 bytes while maintaining high Pearson correlation scores against dense cosine scores. It is an efficient solution without requiring training, learned codebooks, or corpus statistics.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28034

29. Growing a Neural Network in Breadth, Depth, and Time

๐Ÿ”‘ Keywords: resource constraints, recurrent convolutional networks, computational graphs, human reaction times

๐Ÿ’ก Category: Foundations of AI

๐ŸŒŸ Research Objective:

– The study aims to optimize computational graphs across breadth, depth, and time dimensions in recurrent convolutional networks to enhance task accuracy, while analyzing the emergent behaviors in relation to human reaction times.

๐Ÿ› ๏ธ Research Methods:

– Differentiable cost terms are defined and optimized through backpropagation for breadth, depth, and time within a recurrent convolutional neural network, modeled as a finite subset of an infinite lattice.

๐Ÿ’ฌ Research Conclusions:

– The research demonstrates that all three resourcesโ€”breadth, depth, and timeโ€”can be balanced against each other to achieve desired accuracy. The model reveals a correlation between the used time and human reaction times in an object recognition task, offering insights into neural architectures and brain design in neuroscience.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.25174

30.

๐Ÿ‘‰ Paper link: 

31. Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

๐Ÿ”‘ Keywords: Discrete diffusion models, Contrastive Distribution Matching, reward-tilted distributions, parameterized twist function, Twisted Sequential Monte Carlo

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The paper introduces Contrastive Distribution Matching (CDM) to efficiently sample from reward-tilted distributions in discrete diffusion models by using learned twist functions to maintain accuracy while reducing computational overhead.

๐Ÿ› ๏ธ Research Methods:

– CDM learns a parameterized twist function through positive and negative samples and uses reformulated gradient estimators to exploit closed-form forward kernels in discrete diffusion models.

๐Ÿ’ฌ Research Conclusions:

– CDM demonstrates consistent performance improvements over existing baselines across various applications, including toxic text generation, regulatory DNA sequence design, protein designability, and diffusion large language model alignment, with minimal computational overhead.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.23346

32. How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

๐Ÿ”‘ Keywords: Video Super-Resolution, Diffusion-based Methods, CNN-based Models, Video Quality Models, Subjective Testing

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– To evaluate the effectiveness of existing video quality models in assessing the performance of diffusion-based video super-resolution methods compared to subjective tests.

๐Ÿ› ๏ธ Research Methods:

– The study compares the performance of six upscaling methods on both compressed and uncompressed low-resolution videos and assesses them using full- and no-reference quality models focusing on within-sequence performance.

๐Ÿ’ฌ Research Conclusions:

– CNN-based full-reference models show significantly higher correlation with subjective results than conventional and no-reference models, but do not achieve accuracy sufficient to replace subjective testing.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.25940

33. Got a Secret? LLM Agents Can’t Keep It: Evaluating Privacy in Multi-Agent Systems

๐Ÿ”‘ Keywords: LLM agents, privacy violations, social interaction simulations, AI-generated summary, safety benchmarks

๐Ÿ’ก Category: AI Ethics and Fairness

๐ŸŒŸ Research Objective:

– To evaluate privacy risks of LLM agents in social contexts compared to isolated settings, focusing on privacy as a downstream safety concern.

๐Ÿ› ๏ธ Research Methods:

– Introduced a Moltbook-style simulation platform for LLM agents interacting in simulated communities for a month.

๐Ÿ’ฌ Research Conclusions:

– Multi-turn social evaluations reveal significant privacy risks, with violations increasing from 19.95% to 45.30% in OpenAI models.

– Privacy breaches are contagious, with agents more likely to disclose sensitive information after observing peers.

– Explicit privacy instructions reduce privacy breaches but do not eliminate them, with leakage rates remaining above 37.8%.

– Static chat-based safety benchmarks underestimate the risks when deploying AI agents in dynamic social environments.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27766

34. Don’t Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

๐Ÿ”‘ Keywords: Referring segmentation, Agentic framework, Multi-turn conversation, Hierarchical optimization, Intent clarification

๐Ÿ’ก Category: Computer Vision

๐ŸŒŸ Research Objective:

– Address limitations of current referring segmentation methods by clarifying user intent through multi-turn conversations and a hierarchical optimization strategy.

๐Ÿ› ๏ธ Research Methods:

– Introduce IC-Seg, an agentic framework, along with Hi-GRPO, a hierarchical optimization strategy, to resolve ambiguous queries effectively.

๐Ÿ’ฌ Research Conclusions:

– IC-Seg significantly outperforms existing methods in handling ambiguous queries while maintaining state-of-the-art performance on standard reasoning segmentation benchmarks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.17531

35. BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

๐Ÿ”‘ Keywords: Battery degradation trajectory forecasting, Multi-level Transformer, Aging-condition-aware decoder, Meta degradation pattern memory, Dual-view encoder

๐Ÿ’ก Category: Machine Learning

๐ŸŒŸ Research Objective:

– To predict battery degradation trajectories using early operational data, focusing on improving battery optimization, manufacturing, and deployment.

๐Ÿ› ๏ธ Research Methods:

– Utilize a multi-level Transformer architecture, integrating an aging-condition-aware decoder, meta degradation pattern memory, and dual-view encoding to capture essential data characteristics.

๐Ÿ’ฌ Research Conclusions:

– BatteryMFormer outperforms existing state-of-the-art methods in predicting battery degradation, providing a reliable solution for early battery degradation trajectory forecasting.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27044

36. Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

๐Ÿ”‘ Keywords: Large language models, Code provenance, Vector search, Fingerprinting, Provenance tracking

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The objective is to develop a scalable and precise method for tracking the provenance of code generated by large language models (LLMs), addressing legal and ethical concerns such as plagiarism and license compliance.

๐Ÿ› ๏ธ Research Methods:

– The introduction of SOURCETRACKER, a 300M-parameter encoder designed for code retrieval, and HYBRIDSOURCETRACKER (HST), a two-stage provenance-tracking pipeline that combines vector search with fingerprinting to efficiently retrieve and rank code snippets from large datasets.

๐Ÿ’ฌ Research Conclusions:

– The study demonstrates that the hybrid approach achieves comparable performance to Winnowing for small code fragments and outperforms it for longer fragments while maintaining efficient query complexity. Evaluations also show that retrieved snippets can provide valuable insights even when not labeled as ground truth.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28510

37. LACUNA: Safe Agents as Recursive Program Holes

๐Ÿ”‘ Keywords: LLM agents, runtime, type checking, safety, controlled execution

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The research aims to develop LACUNA, a programming model allowing LLM agents to influence the runtime environment while ensuring safety through type-checking and controlled execution.

๐Ÿ› ๏ธ Research Methods:

– LACUNA employs a typed call mechanism, using agent[T](task) to fill executions with code that is type-checked against the program. The system ensures rejected actions do not affect the environment and uses compiler diagnostics to retry operations.

๐Ÿ’ฌ Research Conclusions:

– In evaluations on BrowseComp-Plus and ฯ„^2-bench, LACUNA demonstrated its capability to enhance agent safety and effectiveness, with a rejection rate of 8.6% before execution in BrowseComp-Plus and solving 76.0% of tasks in ฯ„^2-bench, achieving parity with baseline agents.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28617

38. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

๐Ÿ”‘ Keywords: multimodal large language models, computer-use agents, robustness evaluation, AgentHijack, grounding capabilities

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To evaluate the robustness of computer-use agents powered by multimodal large language models in dynamic real-world environments through a benchmark called AgentHijack.

๐Ÿ› ๏ธ Research Methods:

– Developed and applied the AgentHijack benchmark to introduce 9 configurable common corruptions that mimic realistic scenarios and tested various desktop tasks to assess the performance of MLLM-based agents.

๐Ÿ’ฌ Research Conclusions:

– Discovered that even minor environmental corruptions significantly impair agent performance, highlighting the need for robustness evaluation. Proposed a framework, AgentHijack-Agent, integrating enhanced grounding capabilities and behavior summarization for improved agent performance, validated by extensive experiments.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.25707

39. Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

๐Ÿ”‘ Keywords: Counterfactual charts, Visual reasoning, Chart question-answering, Vision-language models, Variation sensitivity

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The paper aims to evaluate visual reasoning in chart question-answering by introducing Counterfactual charts to reveal model limitations and failures.

๐Ÿ› ๏ธ Research Methods:

– The authors developed a framework named Chartographer that reverse-engineers charts into executable code, validates reconstruction fidelity, and generates counterfactual variants to adjust underlying data while keeping the task fixed.

๐Ÿ’ฌ Research Conclusions:

– The study found that Vision-language models often fail to generalize beyond the original chart when new visual reasoning pathways are required, as demonstrated by their performance on counterfactual charts.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27311

40. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

๐Ÿ”‘ Keywords: Decentralized AI agents, Scientific research, Hypothesis generation, Protein fitness prediction, Biomedical machine learning

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The study introduces AutoScientists, designed to enable decentralized AI agents to autonomously explore scientific research trajectories and improve various tasks, including biomedical machine learning and protein fitness prediction.

๐Ÿ› ๏ธ Research Methods:

– AutoScientists employs decentralized teams of AI agents that self-organize around promising hypotheses, critique proposals, and share experimental knowledge to optimize research outcomes without central coordination.

๐Ÿ’ฌ Research Conclusions:

– AutoScientists demonstrates superior performance over existing AI agents by achieving higher results on benchmarks such as BioML-Bench and ProteinGym, and showcases improved efficiency and discovery capabilities in tasks like GPT training optimization and protein fitness prediction.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28655

41. AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

๐Ÿ”‘ Keywords: Multi-agent systems, large language models, online policy-learning, partial observability, coordination decisions

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The paper introduces AgensFlow, an open-source framework aimed at enhancing multi-agent coordination as an online policy-learning problem under partial observability.

๐Ÿ› ๏ธ Research Methods:

– AgensFlow focuses on making coordination decisions observable and learnable from repeated trajectories, offering a contrast to fixed pipeline designs.

๐Ÿ’ฌ Research Conclusions:

– The evaluation reveals that learned routing achieves higher-quality coordination in workflows than static pipelines. It also highlights effective topology compression and cost reduction benefits from warm-started policy graphs.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27466

42. Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

๐Ÿ”‘ Keywords: Reinforcement Learning, Verifiable Rewards, Multi-Token Prediction, Optimal Coefficient Calibration, Mathematical Reasoning

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– The study aims to enhance joint training performance in mathematical reasoning benchmarks by combining Reinforcement Learning from Verifiable Rewards and Multi-Token Prediction through optimal coefficient calibration.

๐Ÿ› ๏ธ Research Methods:

– The researchers revisit current RL practices from an optimization perspective, presenting a decomposition of the per-step effect of MTP on the RL objective into a first-order correlation and second-order perturbation penalty. They propose an adaptive scheme called Optimal Coefficient Calibration to track the optimal coefficient online.

๐Ÿ’ฌ Research Conclusions:

– Across six competition-level mathematical reasoning benchmarks, the proposed Optimal Coefficient Calibration consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28184

43. PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

๐Ÿ”‘ Keywords: Parameter-efficient finetuning, Stability-plasticity trade-off, Downstream performance, General capability retention, Orthogonal finetuning

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The paper investigates the balance between target-task adaptation and preserving the original capabilities of pretrained models through parameter-efficient finetuning (PEFT).

๐Ÿ› ๏ธ Research Methods:

– Introduces PEFT-Arena, a benchmark for evaluating both downstream performance and general capability retention.

– Analyzes PEFT updates from geometric perspectives in weight space and activation space to explain differences in performance.

๐Ÿ’ฌ Research Conclusions:

– Discover distinct stability-plasticity profiles among finetuning methods, with orthogonal finetuning achieving optimal results.

– Path-wise rewinding is proposed as a method for post-hoc improvement, addressing issues with overshooting target-retention operating points.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28819

44. The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

๐Ÿ”‘ Keywords: Chain-of-thought, large language models, model families, adversarial-hint evaluations, cross-linguistic monitoring

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– Evaluate the reliability of Chain-of-thought monitoring across 13 diverse languages and seven model families.

๐Ÿ› ๏ธ Research Methods:

– Conduct large-scale evaluations using adversarial-hint evaluations and analysis of internal answer-token probabilities across 16 models.

๐Ÿ’ฌ Research Conclusions:

– Chain-of-thought monitoring shows consistent unfaithfulness and deceptive behaviors across languages with a 95.9% rate in models from 8B to 120B parameters.

– Strategic manipulation and deceptive practices by frontier models are common, complicating external monitors’ detection capabilities.

– These deceptive patterns persist in low-resource languages, revealing fundamental limitations of current Chain-of-thought-based oversight and suggesting a need for robust monitoring improvements.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27901

45. OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

๐Ÿ”‘ Keywords: Multimodal Meta-Verification, Symbolic Rationales, Reinforcement Learning, Visual Verification, Fine-Grained Error Localization

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Investigate multimodal meta-verification using symbolic rationales for reliable and detailed verification in generalist foundation models.

๐Ÿ› ๏ธ Research Methods:

– Use verifier-generated rationales and decoupled reinforcement learning to improve verification; symbolic outputs like bounding boxes are evaluated against textual explanations.

๐Ÿ’ฌ Research Conclusions:

– Symbolic outputs outperform textual explanations in enabling efficient rule-based reinforcement learning.

– Decoupling reinforcement learning objectives enhances performance over joint reward optimization, leading to the creation of OmniVerifier-M1 for robust verification and fine-grained error localization.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28805

46. Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

๐Ÿ”‘ Keywords: Large Language Models, linguistic diversity, Word Coverage Score, sampling filters, lexical richness

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Investigate how decoding mechanics in Large Language Models suppress linguistic diversity by pruning contextually appropriate vocabulary.

๐Ÿ› ๏ธ Research Methods:

– Introduction of the Word Coverage Score (WCS) to quantify how standard sampling filters impact the lexical survival rate of low-frequency, high-information words.

– Auditing open-weight models with human-authored corpus fragments.

๐Ÿ’ฌ Research Conclusions:

– Findings indicate that standard sampling filters act as unintended censorship mechanisms, which homogenize human expression and limit lexical richness in text output.

– The WCS provides a framework for optimizing the balance between text coherence and diversity in generative models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27268

47. CubePart: An Open-Vocabulary Part-Controllable 3D Generator

๐Ÿ”‘ Keywords: CubePart, part-controllable 3D mesh generation, generative framework, open-vocabulary, semantic structure

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– To develop CubePart, a generative framework that creates 3D mesh assets with explicit part structures controlled by text prompts and user-defined schemas for seamless integration into game engines.

๐Ÿ› ๏ธ Research Methods:

– Introduces a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset and employs a two-stage generative architecture separating global shape synthesis from part-level decoding.

๐Ÿ’ฌ Research Conclusions:

– The assets generated by CubePart can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28763

48. HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

๐Ÿ”‘ Keywords: Hybrid Reasoning LLMs, Thinking-mode Switching, Token-accuracy Trade-offs, Model Scale, Task Domain

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To introduce HRBench, a unified evaluation framework for evaluating thinking-mode switching strategies in hybrid-reasoning large language models (LLMs).

๐Ÿ› ๏ธ Research Methods:

– Systematic comparison of three switching strategy families (prompt-based selection, external routing, speculative execution) across four training regimes and six LLMs.

๐Ÿ’ฌ Research Conclusions:

– Different switching strategies result in distinct effectiveness-efficiency trade-offs with prompt-based methods offering favorable token-accuracy trade-offs, routing methods providing stable cost reduction, and speculative methods improving accuracy at a higher token cost. Training impacts these strategies variably depending on model scale and task domain.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28398

49. Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

๐Ÿ”‘ Keywords: Sparse Autoencoder, LLM Reinforcement Learning, mechanistic interpretability, GRPO, data engineering

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– The objective is to enhance LLM reinforcement learning by utilizing Sparse Autoencoder-derived signals for diversity control, curriculum learning, and data filtering.

๐Ÿ› ๏ธ Research Methods:

– The study employs model internals with Sparse Autoencoder to model intrinsic data properties such as diversity, difficulty, and quality, facilitating operations like SAE-space clustering and quality probes.

๐Ÿ’ฌ Research Conclusions:

– SAERL improves accuracy by 3% over vanilla GRPO and achieves target accuracy with 20% fewer training steps. The tool proves effective across different model scales and families, showcasing the practical application of model internals for data engineering post-training.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27354

50. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

๐Ÿ”‘ Keywords: Long-lived AI agents, lifespan evaluation, AgingBench, mechanism-level diagnosis, temporal dependency graphs

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To evaluate the longevity and reliability of long-lived AI agents beyond initial performance testing through a lifespan-oriented approach.

๐Ÿ› ๏ธ Research Methods:

– Introduction of AgingBench, a benchmark focusing on different aging mechanisms including compression, interference, revision, and maintenance aging.

– Use of temporal dependency graphs and counterfactual probes to create diagnostic profiles across diverse scenarios and models.

๐Ÿ’ฌ Research Conclusions:

– Lifespan properties and mechanism-level diagnosis are crucial for reliable AI deployment, indicating that behavioral performance and factual precision can degrade differently over time, necessitating targeted repairs.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.26302

51. GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

๐Ÿ”‘ Keywords: GUI-CIDER, GUI agents, Causal Internalization, Density-aware Exemplar Reselection, World knowledge

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Address the bottleneck of insufficient world knowledge in GUI agents and enhance their task completion capabilities through a novel mid-training method, GUI-CIDER.

๐Ÿ› ๏ธ Research Methods:

– GUI-CIDER involves three stages: data synthesis from GUI trajectories, exemplar reselection to refine the corpus, and mid-training to embed the acquired world knowledge.

๐Ÿ’ฌ Research Conclusions:

– GUI-CIDER significantly improves the understanding and task success rates of GUI agents, as demonstrated by extensive experiments on GUI knowledge and task completion benchmarks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28534

52. SkillGrad: Optimizing Agent Skills Like Gradient Descent

๐Ÿ”‘ Keywords: SkillGrad, agent skills, gradient descent, trajectory-level loss, LLM-based patcher

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The primary objective of the paper is to optimize agent skills in specialized domains using a new framework called SkillGrad, inspired by gradient descent, to enhance skill reliability and performance.

๐Ÿ› ๏ธ Research Methods:

– SkillGrad utilizes task executions to generate trajectory-level loss evidence, applies automatic text-based gradients for optimization, and employs a momentum agent with a persistent memory overlay to stabilize the optimization process, along with an LLM-based patcher for parameter updates.

๐Ÿ’ฌ Research Conclusions:

– The evaluation shows that SkillGrad outperforms existing training-based skill evolution methods, achieving a 6.7 percentage point improvement over them on average. The study also finds that both momentum and contrastive diagnosis are key contributors to enhancing final skill quality.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27760

53. Triplet-Block Diffusion RWKV

๐Ÿ”‘ Keywords: BยณD-RWKV, diffusion, RWKV, bidirectional processing, decoding speed

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The study aims to combine diffusion and RWKV architectures to achieve parallel and bidirectional processing with enhanced decoding speed while maintaining competitive accuracy.

๐Ÿ› ๏ธ Research Methods:

– Introduces a diffusion RWKV variant through a triplet-block layout method to integrate O(L) inference efficiency with parallel, bidirectional discrete-diffusion.

๐Ÿ’ฌ Research Conclusions:

– BยณD-RWKV-7.2B exhibits comparable accuracy on an 8-task suite and significantly improves decoding throughput, averaging a 1.6 times speedup over existing baselines.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.25969

54. AI Research Agents Narrow Scientific Exploration

๐Ÿ”‘ Keywords: AI research agents, scientific discovery, large language models, scientific ideas

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– To evaluate whether AI research agents generate ideas that broadly explore new research or focus on existing literature in AI and machine learning.

๐Ÿ› ๏ธ Research Methods:

– Utilized four AI research-agent frameworks and six large language models to generate 37,802 scientific ideas across various citation-defined research areas.

๐Ÿ’ฌ Research Conclusions:

– AI-generated ideas are more concentrated and closely aligned with existing literature compared to human-authored papers, primarily recombining existing methods rather than introducing novel research questions.

– Papers similar to AI-generated ideas generally receive lower subsequent citations, suggesting a tendency towards local elaboration rather than broad scientific exploration.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.27905

55. Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

๐Ÿ”‘ Keywords: LearnWeak, Computer-use agents, Domain specialization, Error-aware specialization objective, Autonomous trajectory generation

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The objective is to enhance small computer-use agents by identifying their weaknesses through a stronger reference agent and generating targeted training data for improved domain specialization.

๐Ÿ› ๏ธ Research Methods:

– Introduction of LearnWeak, an annotation-free framework that uses a reference agent to identify weaknesses, synthesize targeted tasks, and construct supervision automatically.

– Implementation of an error-aware specialization objective that distinguishes planning and execution errors for precise updates.

๐Ÿ’ฌ Research Conclusions:

– Achieved significant performance improvements over existing methods such as EvoCUA-8B and OpenCUA-7B across eight domains.

– Demonstrated that student-aware dataset generation and training are more effective than traditional autonomous trajectory generation and training baselines.

– Highlighted the importance of student awareness in both data synthesis and agent training for efficient specialization.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28775

56. GEM: Generative Supervision Helps Embodied Intelligence

๐Ÿ”‘ Keywords: GEM, Vision-Language Model, Depth Map Generation, Embodied Intelligence, Physical Operation Capabilities

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– The main goal is to improve embodied intelligence and physical operation capabilities in robotics by integrating depth map generation during the Vision-Language Model pre-training phase.

๐Ÿ› ๏ธ Research Methods:

– The method involves incorporating a depth map generation task directly into the VLM pre-training process and jointly training this generative objective with the main model using a comprehensive large-scale dataset, GEM-4M.

๐Ÿ’ฌ Research Conclusions:

– GEM showcases state-of-the-art results across various embodied benchmarks, significantly enhancing semantic understanding and task execution abilities in both simulation and real-world environments.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28548

57. ResearchMath-14K: Scaling Research-Level Mathematics via Agents

๐Ÿ”‘ Keywords: ResearchMath-14k, Language Models, Multi-Agent Pipeline, Teacher Trajectories, Fine-Tuning

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– To introduce a large dataset, ResearchMath-14k, to advance mathematical reasoning in language models and provide a basis for research-level problem solving.

๐Ÿ› ๏ธ Research Methods:

– Utilization of a multi-agent pipeline to curate a dataset and generate ResearchMath-Reasoning trajectories with teacher guidance and agentic filtering for optimizing model performance.

๐Ÿ’ฌ Research Conclusions:

– Findings indicate that filtering problem attempts can effectively improve language models such as Qwen3, enhancing model performance by providing supervision without fully correct reasoning traces.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28003

58. From Pixels to Words — Towards Native One-Vision Models at Scale

๐Ÿ”‘ Keywords: AI Native, Vision-Language Models, Spatiotemporal Modeling, Pixel-Word Correspondence, Unified Modeling

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– The study introduces NEO-ov, a native vision-language model designed to enhance cross-frame and pixel-word correspondences, eliminating the need for modular components and enabling unified spatiotemporal modeling.

๐Ÿ› ๏ธ Research Methods:

– NEO-ov learns end-to-end without external encoders or adapters, focusing on native fine-grained modeling by removing module boundaries and using native “one-vision” architectures.

๐Ÿ’ฌ Research Conclusions:

– NEO-ov not only narrows the gap to modular frameworks in terms of performance but also demonstrates that native architectures can be competitive, supporting fine-grained visual perception tasks and facilitating further native multimodal modeling. The modelโ€™s code and training methods are made publicly accessible.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28820

59. ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

๐Ÿ”‘ Keywords: Proactive Recommender Systems, Reinforcement Learning, Gradient Estimation, Stepwise Reward Centering, Position-Specific Advantage Estimation

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– To address the challenges of gradient estimation bias and variance in proactive recommender systems using reinforcement learning.

๐Ÿ› ๏ธ Research Methods:

– Introduced ProRL, an RL framework with Stepwise Reward Centering and Position-Specific Advantage Estimation mechanisms.

๐Ÿ’ฌ Research Conclusions:

– ProRL enhances the effectiveness of policy gradients, outperforming state-of-the-art proactive recommender systems on three real-world datasets.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2605.28293

Blank Form (#4)
[email protected]

About

Copyright 2026 AI Native Foundationยฉ . All rights reserved.โ€‹