AI Native Daily Paper Digest – 20250717

1. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
๐ Keywords: Reasoning-Enhanced RAG, RAG-Enhanced Reasoning, Synergized RAG-Reasoning, Trustworthy, Human-Centric
๐ก Category: Knowledge Representation and Reasoning
๐ Research Objective:
– To integrate reasoning and retrieval in Large Language Models to enhance factuality and multi-step inference.
๐ ๏ธ Research Methods:
– Analyzing and mapping how advanced reasoning optimizes Retrieval-Augmented Generation (RAG) stages.
– Exploring how retrieved knowledge supports complex inference in RAG-Enhanced Reasoning frameworks.
– Featuring Synergized RAG-Reasoning frameworks for iterative search and reasoning.
๐ฌ Research Conclusions:
– Synergized RAG-Reasoning frameworks achieve state-of-the-art performance across knowledge-intensive benchmarks.
– The survey categorizes methods, datasets, and outlines future research directions for a more effective, multimodally-adaptive, and human-centric RAG-Reasoning approach.
๐ Paper link: https://huggingface.co/papers/2507.09477

2. PhysX: Physical-Grounded 3D Asset Generation
๐ Keywords: 3D generative models, physical properties, PhysXNet, PhysXGen, generative physical AI
๐ก Category: Generative Models
๐ Research Objective:
– To address the lack of physical properties in 3D generative models by introducing PhysX, an end-to-end paradigm for physical-grounded 3D asset generation.
๐ ๏ธ Research Methods:
– Introduction of PhysXNet, a physics-annotated dataset with five foundational dimensions: absolute scale, material, affordance, kinematics, and function description.
– Development of PhysXGen, a feed-forward framework employing a dual-branch architecture to integrate physical knowledge into 3D asset generation.
๐ฌ Research Conclusions:
– PhysX significantly enhances the physical plausibility of generated 3D assets, validated through extensive experiments demonstrating superior performance and generalization.
– The release of all code, data, and models aims to support future research in the field of generative physical AI.
๐ Paper link: https://huggingface.co/papers/2507.12465
3. SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
๐ Keywords: SWE-Perf, Large Language Models, code performance optimization, benchmark, GitHub repositories
๐ก Category: AI Systems and Tools
๐ Research Objective:
– Introduce SWE-Perf as the first benchmark specifically designed to evaluate Large Language Models on code performance optimization tasks using real-world repository data.
๐ ๏ธ Research Methods:
– Analyze 140 curated instances from performance-improving pull requests on GitHub, including codebases, target functions, performance tests, expert-authored patches, and executable environments.
๐ฌ Research Conclusions:
– Identify a significant capability gap between existing Large Language Models and expert-level code optimization performance, highlighting opportunities for further research in this field.
๐ Paper link: https://huggingface.co/papers/2507.12415

4. MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
๐ Keywords: Human Behavior Analysis, Autonomous Driving, Motion Prediction, Behavior Question Answering
๐ก Category: Robotics and Autonomous Systems
๐ Research Objective:
– The research introduces MMHU, a comprehensive benchmark for analyzing human behavior in autonomous driving, with the aim of enhancing driving safety by understanding human motions, trajectories, and intentions.
๐ ๏ธ Research Methods:
– The method involves collecting diverse data sources, such as Waymo and YouTube, and self-collected data to gather 57k human motion clips and 1.73M frames. A human-in-the-loop annotation pipeline is also developed to create detailed behavior captions.
๐ฌ Research Conclusions:
– A thorough dataset analysis and the establishment of benchmarks for multiple tasks, including motion prediction and motion generation, offering a broad set of evaluation tools for human behavior in autonomous driving.
๐ Paper link: https://huggingface.co/papers/2507.12463

5. MOSPA: Human Motion Generation Driven by Spatial Audio
๐ Keywords: Spatial Audio, Human Motion, MOSPA, Diffusion-based Framework, SAM Dataset
๐ก Category: Generative Models
๐ Research Objective:
– Introduce MOSPA, a diffusion-based generative framework, to model human motion in response to spatial audio and achieve state-of-the-art performance.
๐ ๏ธ Research Methods:
– Develop the SAM dataset containing high-quality spatial audio and motion data.
– Utilize a simple yet effective diffusion-based generative framework for modeling human motion driven by spatial audio.
๐ฌ Research Conclusions:
– MOSPA successfully models the relationship between body motion and spatial audio, showing diverse and realistic human motions.
– The proposed method achieves state-of-the-art performance and will be open-sourced upon acceptance.
๐ Paper link: https://huggingface.co/papers/2507.11949
6. Seq vs Seq: An Open Suite of Paired Encoders and Decoders
๐ Keywords: Large language model, Encoder-only, Decoder-only, Text generation, Open-source
๐ก Category: Natural Language Processing
๐ Research Objective:
– Introduce and compare encoder-only and decoder-only models using the SOTA open-data Ettin suite, ranging from 17 million to 1 billion parameters.
๐ ๏ธ Research Methods:
– Both encoder-only and decoder-only models are trained on up to 2 trillion tokens using the same recipes to achieve state-of-the-art results.
๐ฌ Research Conclusions:
– Encoder-only models excel in classification and retrieval, while decoder-only models excel in text generation. Adapting models for cross-architecture tasks is less effective than using the original architecture.
– Open-source release of all training artifacts for community use.
๐ Paper link: https://huggingface.co/papers/2507.11412

7. DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
๐ Keywords: LLM agents, technical drawing revision, benchmark, open-source, function execution
๐ก Category: AI Systems and Tools
๐ Research Objective:
– To propose DrafterBench, an open-source benchmark, for evaluating LLM agents in technical drawing revision, focusing on structured data comprehension and function execution.
๐ ๏ธ Research Methods:
– Develop DrafterBench with twelve task types and 46 customized functions/tools, testing in 1920 tasks to assess AI agents’ capabilities in structured data comprehension, instruction following, and critical reasoning.
๐ฌ Research Conclusions:
– DrafterBench offers detailed analysis of task accuracy and error statistics, helping provide deeper insight into agent capabilities and improvement targets for integrating LLMs into engineering applications.
๐ Paper link: https://huggingface.co/papers/2507.11527

8. AnyI2V: Animating Any Conditional Image with Motion Control
๐ Keywords: AI Native, motion trajectories, conditional images, video generation, style transfer
๐ก Category: Generative Models
๐ Research Objective:
– The primary goal is to propose a training-free framework, AnyI2V, that animates conditional images with user-defined motion trajectories to enhance video generation versatility and flexibility.
๐ ๏ธ Research Methods:
– Introducing AnyI2V, which supports broader modalities and mixed conditional inputs, enabling features like style transfer and video editing without training, and integrating user-defined motion trajectories.
๐ฌ Research Conclusions:
– AnyI2V provides superior performance in spatial- and motion-controlled video generation and opens up new possibilities in flexible and versatile video synthesis, as evidenced by extensive experiments.
๐ Paper link: https://huggingface.co/papers/2507.02857

9. Lizard: An Efficient Linearization Framework for Large Language Models
๐ Keywords: Subquadratic Architectures, Transformer-based LLMs, Hybrid Attention Mechanism, Gated Linear Attention, Hardware-Aware Training
๐ก Category: Natural Language Processing
๐ Research Objective:
– The research aims to develop “Lizard,” a linearization framework that transforms pre-trained Transformer-based LLMs into subquadratic architectures, allowing for efficient infinite-context generation while addressing memory and computational bottlenecks.
๐ ๏ธ Research Methods:
– Lizard utilizes a subquadratic attention mechanism approximating softmax attention, incorporates a gating module inspired by state-of-the-art linear models, and combines gated linear attention with sliding window attention enhanced by meta memory.
๐ฌ Research Conclusions:
– The framework, Lizard, achieves near-lossless recovery of the teacher model’s performance on standard language modeling tasks and outperforms previous methods, notably improving by 18 points on the 5-shot MMLU benchmark and enhancing results on associative recall tasks.
๐ Paper link: https://huggingface.co/papers/2507.09025

10. SpatialTrackerV2: 3D Point Tracking Made Easy
๐ Keywords: SpatialTrackerV2, 3D point tracking, monocular videos, end-to-end architecture, camera pose estimation
๐ก Category: Computer Vision
๐ Research Objective:
– Develop a feed-forward 3D point tracking method for monocular videos that integrates point tracking, monocular depth, and camera pose estimation into a unified end-to-end architecture.
๐ ๏ธ Research Methods:
– Decompose world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion with a fully differentiable and scalable architecture that trains on diverse datasets.
๐ฌ Research Conclusions:
– Achieves a 30% performance improvement over existing 3D tracking methods and matches leading dynamic 3D reconstruction accuracy while operating 50 times faster.
๐ Paper link: https://huggingface.co/papers/2507.12462
