AI Native Daily Paper Digest – 20250513

1. Seed1.5-VL Technical Report

๐Ÿ”‘ Keywords: Vision-Language Foundation Model, Multimodal Understanding, Mixture-of-Experts, State-of-the-Art Performance, GUI Control

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– To advance general-purpose multimodal understanding and reasoning with Seed1.5-VL.

๐Ÿ› ๏ธ Research Methods:

– Developed a 532M-parameter vision encoder integrated with a Mixture-of-Experts LLM totaling 20B active parameters.

– Evaluated across public VLM benchmarks and internal suites.

๐Ÿ’ฌ Research Conclusions:

– Seed1.5-VL achieved state-of-the-art performance on 38 out of 60 public benchmarks.

– Demonstrated superior performance in agent-centric tasks such as GUI control and gameplay, surpassing systems like OpenAI CUA and Claude 3.7.

– Exhibited strong reasoning capabilities effective for multimodal reasoning challenges.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07062

2. MiMo: Unlocking the Reasoning Potential of Language Model — From Pretraining to Posttraining

๐Ÿ”‘ Keywords: MiMo-7B, reasoning tasks, large language model, pre-training, post-training

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The primary goal is to develop MiMo-7B, a large language model optimized for reasoning tasks.

๐Ÿ› ๏ธ Research Methods:

– Enhancement of data preprocessing and implementation of a three-stage data mixing strategy during pre-training.

– Integration of reinforcement learning with a test-difficulty-driven code-reward scheme in post-training.

๐Ÿ’ฌ Research Conclusions:

– MiMo-7B-Base exhibits exceptional reasoning capabilities, outperforming larger models up to 32B.

– The advanced model, MiMo-7B-RL, excels in mathematics, programming, and general reasoning tasks, surpassing OpenAI o1-mini.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07608

3. Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

๐Ÿ”‘ Keywords: Generative AI, 3D Generation, Data Scarcity, Algorithmic Limitations, Open Framework

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Step1X-3D addresses the underdevelopment in 3D generative AI by tackling challenges like data scarcity and algorithmic limitations.

๐Ÿ› ๏ธ Research Methods:

– The study introduces a framework featuring a data curation pipeline for a 2M high-quality dataset and a two-stage 3D-native architecture using a combination of VAE-DiT geometry generator and diffusion-based texture synthesis.

๐Ÿ’ฌ Research Conclusions:

– The framework’s state-of-the-art performance surpasses existing open-source methods and bridges 2D and 3D generation techniques, thereby setting new standards for open research in controllable 3D asset generation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07747

4. Learning from Peers in Reasoning Models

๐Ÿ”‘ Keywords: Prefix Dominance Trap, Learning from Peers (LeaP), error correction, error tolerance, reasoning

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– To address the “Prefix Dominance Trap” in Large Reasoning Models (LRMs) by facilitating self-correction through peer interaction.

๐Ÿ› ๏ธ Research Methods:

– Introduction of the Learning from Peers (LeaP) model using a routing mechanism to share reasoning insights among paths, and fine-tuning smaller models into the LeaP-T series to enhance their performance.

๐Ÿ’ฌ Research Conclusions:

– LeaP models demonstrate substantial improvements in reasoning tasks, providing robust error correction and tolerance, with significant performance gains over baseline models and established benchmarks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07787

5. Unified Continuous Generative Models

๐Ÿ”‘ Keywords: Continuous Generative Models, Unified Framework, State-of-the-art, Diffusion Transformer, ImageNet

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Introduce a unified framework for training, sampling, and analyzing continuous generative models to achieve superior performance.

๐Ÿ› ๏ธ Research Methods:

– Implementation of Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}) that demonstrates improved efficiency and performance in generative tasks.

๐Ÿ’ฌ Research Conclusions:

– UCGM achieves state-of-the-art results on ImageNet, with impressive FID scores in reduced steps, showcasing its effectiveness over traditional multi-step and few-step models.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07447

6. REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

๐Ÿ”‘ Keywords: Large Language Models, human-annotated instruction data, semi-automated framework, Reinforcement Learning

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To explore the efficiency of small open-source Large Language Models (LLMs) in generating instruction datasets with reduced human involvement and costs.

๐Ÿ› ๏ธ Research Methods:

– Utilization of a semi-automated framework with small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B.

– Integration of a Reinforcement Learning (RL)-based training algorithm into the LLMs-based framework.

๐Ÿ’ฌ Research Conclusions:

– The RL-based framework shows significant improvements in 63-66% of tasks compared to previous methods, demonstrating effectiveness in task performance enhancement.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.06548

7. AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

๐Ÿ”‘ Keywords: AttentionInfluence, reasoning-intensive pretraining, attention head masking, SmolLM corpus, weak-to-strong scaling

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To improve the complex reasoning ability of large language models (LLMs) by utilizing reasoning-intensive pretraining data without introducing domain-specific biases.

๐Ÿ› ๏ธ Research Methods:

– Introduced AttentionInfluence, a training-free method using attention head masking to identify and select data.

– Applied this method to a 1.3B-parameter model on the SmolLM corpus, leading to the pretraining of a larger 7B-parameter model with mixed data.

๐Ÿ’ฌ Research Conclusions:

– Demonstrated significant performance improvements on several knowledge-intensive benchmarks, showcasing an effective weak-to-strong scaling property for data selection strategies.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07293

8. DanceGRPO: Unleashing GRPO on Visual Generation

๐Ÿ”‘ Keywords: Generative Models, Reinforcement Learning, DanceGRPO, Visual Generation, Diffusion Models

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Address the alignment challenges of generative model outputs with human preferences and improve RL-based visual generation methods.

๐Ÿ› ๏ธ Research Methods:

– Introduction of DanceGRPO, a unified RL framework adapting Group Relative Policy Optimization to work across diverse generative paradigms, tasks, foundational models, and reward models.

๐Ÿ’ฌ Research Conclusions:

– DanceGRPO demonstrates significant improvements, outperforming baselines by up to 181%, and establishes itself as a robust solution for scaling Reinforcement Learning from Human Feedback tasks in visual generation.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07818

9. WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

๐Ÿ”‘ Keywords: LLM-based agents, WebGen-Bench, GPT-4o, web applications, web-navigation agent

๐Ÿ’ก Category: AI Systems and Tools

๐ŸŒŸ Research Objective:

– The study aims to introduce WebGen-Bench, a novel benchmark to evaluate the ability of LLM-based agents to create multi-file website codebases from scratch.

๐Ÿ› ๏ธ Research Methods:

– Diverse instructions for website generation were compiled using human annotators and GPT-4o, covering major and minor categories of web applications.

– Test cases targeting specific functionalities were generated and manually refined to ensure accuracy. A web-navigation agent was used to automate testing and improve reproducibility.

๐Ÿ’ฌ Research Conclusions:

– Among several evaluated code-agent frameworks, Bolt.diy powered by DeepSeek-R1 achieved only 27.8% accuracy, indicating the challenging nature of the benchmark.

– Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories improved accuracy to 38.2%, outperforming the best proprietary model.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.03733

10. Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

๐Ÿ”‘ Keywords: Multimodal Reward Model, Vision-Language Models, Reward Model Architecture, Multimodal Reasoning, General-Purpose Reliable Reward Models

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Propose Skywork-VL Reward, a model that offers reward signals for multimodal understanding and reasoning tasks.

๐Ÿ› ๏ธ Research Methods:

– Construct a large-scale multimodal preference dataset covering multiple tasks and scenarios.

– Design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and using multi-stage fine-tuning with pairwise ranking loss on pairwise preference data.

๐Ÿ’ฌ Research Conclusions:

– Skywork-VL Reward achieves state-of-the-art performance on multimodal VL-RewardBench and shows competitive results on the text-only RewardBench benchmark.

– The model advances general-purpose, reliable reward models for multimodal alignment and is publicly released for transparency and reproducibility.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07263

11. Learning Dynamics in Continual Pre-Training for Large Language Models

๐Ÿ”‘ Keywords: Continual Pre-Training, large language models, distribution shift, learning rate annealing, CPT scaling law

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– The study aims to explore the learning dynamics in the Continual Pre-Training (CPT) process for large language models, focusing on the evolution of general and domain-specific performance at each training step.

๐Ÿ› ๏ธ Research Methods:

– The researchers analyzed the CPT loss curve, describing its transition using the decoupling effects of distribution shift and learning rate annealing. They derived a CPT scaling law to predict loss at any training step and across various learning rate schedules.

๐Ÿ’ฌ Research Conclusions:

– The formulation provides a comprehensive understanding of critical CPT factors such as loss potential, peak learning rate, and replay ratio. It also allows for the customization of training hyper-parameters to balance general and domain-specific performance, with validation via extensive experiments across different datasets and parameters.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07796

12. Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

๐Ÿ”‘ Keywords: Retrieval-augmented generation, Large Language Models, Search Agent, Internal-External Knowledge Synergy, Reinforcement Learning

๐Ÿ’ก Category: Knowledge Representation and Reasoning

๐ŸŒŸ Research Objective:

– The paper aims to address the limitations of current retrieval-augmented generation strategies in large language models by introducing the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA).

๐Ÿ› ๏ธ Research Methods:

– IKEA utilizes a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset to synergize internal and external knowledge, reducing unnecessary retrievals and optimizing search processes.

๐Ÿ’ฌ Research Conclusions:

– IKEA significantly outperforms baseline methods in multiple knowledge reasoning tasks, reducing retrieval frequency and demonstrating strong generalization capabilities.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07596

13. MonetGPT: Solving Puzzles Enhances MLLMs’ Image Retouching Skills

๐Ÿ”‘ Keywords: Generative editing, Procedural edits, Multimodal large language model, Explainability, Identity preservation

๐Ÿ’ก Category: Multi-Modal Learning

๐ŸŒŸ Research Objective:

– Investigate if a Multimodal large language model (MLLM) can critique raw photographs and suggest suitable procedural edits.

๐Ÿ› ๏ธ Research Methods:

– Train MLLMs to understand image processing operations using specially designed visual puzzles and create a reasoning dataset from expert-edited photos for fine-tuning.

๐Ÿ’ฌ Research Conclusions:

– The operation-aware MLLM can plan and propose edit sequences that maintain object details and resolution, with advantages in explainability and identity preservation compared to existing methods.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.06176

14. Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

๐Ÿ”‘ Keywords: Generative AI, empirical evaluation, AI Competitions, leakage, contamination

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– Highlight the inadequacy of traditional ML evaluation strategies for modern Generative AI models and propose AI Competitions as a rigorous alternative.

๐Ÿ› ๏ธ Research Methods:

– Discuss the current challenges of evaluating Generative AI models, focusing on unbounded input/output spaces and prediction dependence.

๐Ÿ’ฌ Research Conclusions:

– Propose viewing AI Competitions as the gold standard for evaluating Generative AI models to effectively address issues like leakage and contamination.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.00612

15. UMoE: Unifying Attention and FFN with Shared Experts

๐Ÿ”‘ Keywords: Sparse Mixture of Experts, Transformer models, attention layers, feed-forward network, efficient parameter sharing

๐Ÿ’ก Category: Machine Learning

๐ŸŒŸ Research Objective:

– To unify Sparse Mixture of Experts (MoE) designs in attention and FFN layers, enhancing Transformer model performance.

๐Ÿ› ๏ธ Research Methods:

– Introducing a novel reformulation of the attention mechanism, uncovering an FFN-like structure within attention modules.

๐Ÿ’ฌ Research Conclusions:

– The proposed UMoE architecture achieves superior performance in attention-based MoE layers while allowing efficient parameter sharing between FFN and attention components.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07260

16. H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

๐Ÿ”‘ Keywords: Visuomotor policy learning, Generative models, Visual perception, Action prediction, Hierarchical structures

๐Ÿ’ก Category: Robotics and Autonomous Systems

๐ŸŒŸ Research Objective:

– Introduce Triply-Hierarchical Diffusion Policy (H^{3}DP) to enhance the integration between visual features and action generation in robotic manipulation.

๐Ÿ› ๏ธ Research Methods:

– Utilize a triply-hierarchical structure, comprising depth-aware input layering, multi-scale visual representations, and a hierarchically conditioned diffusion process to model the action distribution.

๐Ÿ’ฌ Research Conclusions:

– H^{3}DP achieves a 27.5% average relative improvement over baselines across 44 simulation tasks and demonstrates superior performance in 4 real-world bimanual manipulation tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07819

17. Document Attribution: Examining Citation Relationships using Large Language Models

๐Ÿ”‘ Keywords: Large Language Models, document summarization, attribution, textual entailment, attention mechanism

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To ensure the trustworthiness and interpretability of Large Language Models in document-based tasks through effective attribution techniques.

๐Ÿ› ๏ธ Research Methods:

– Proposing a zero-shot approach using textual entailment with flan-ul2 to improve attribution benchmarks.

– Exploring the attention mechanism with flan-t5-small to enhance process accuracy.

๐Ÿ’ฌ Research Conclusions:

– The proposed methods improve reliability in attribution tasks, demonstrating better performance over existing baselines.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.06324

18. Overflow Prevention Enhances Long-Context Recurrent LLMs

๐Ÿ”‘ Keywords: LLMs, long-context processing, recurrent memory, chunk-based inference, LongBench

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To investigate the efficiency of recurrent sub-quadratic models in improving long-context processing.

๐Ÿ› ๏ธ Research Methods:

– Experiments on fixed-size recurrent memory models using a chunk-based inference procedure to process only relevant input portions.

๐Ÿ’ฌ Research Conclusions:

– Chunk-based inference enhances long-context task performance significantly, with improvements ranging from 14% to 51% on various models.

– The method achieves state-of-the-art results on the LongBench v2 benchmark.

– Raises questions on the effectiveness of recurrent models in utilizing long-range dependencies, as chunk-based strategy performs better even in cross-context tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07793

19. Continuous Visual Autoregressive Generation via Score Maximization

๐Ÿ”‘ Keywords: Visual AutoRegressive modeling, Continuous VAR, strictly proper scoring rules, energy score

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The paper introduces a Continuous VAR framework for autoregressive generation of continuous visual data without needing vector quantization.

๐Ÿ› ๏ธ Research Methods:

– The framework is based on strictly proper scoring rules, particularly exploring training objectives using the energy score, which is likelihood-free.

๐Ÿ’ฌ Research Conclusions:

– The approach overcomes information loss seen in traditional methods and can extend previous methodologies like GIVT and diffusion loss by using different strictly proper scores.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07812

20. Physics-Assisted and Topology-Informed Deep Learning for Weather Prediction

๐Ÿ”‘ Keywords: PASSAT, weather prediction, deep learning, advection equation, spherical graph neural network

๐Ÿ’ก Category: Machine Learning

๐ŸŒŸ Research Objective:

– To develop a novel Physics-ASSisted And Topology-informed deep learning model called PASSAT for improving weather prediction by addressing the limitations of existing models.

๐Ÿ› ๏ธ Research Methods:

– PASSAT attributes weather evolution to the advection process and Earth-atmosphere interaction, solving equations on a spherical manifold using a spherical graph neural network.

๐Ÿ’ฌ Research Conclusions:

– PASSAT outperforms state-of-the-art deep learning and operational numerical weather prediction models on the 5.625ยฐ-resolution ERA5 data set.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.04918

21. DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

๐Ÿ”‘ Keywords: Retrieval-augmented generation (RAG), LLM-based rerankers, reinforcement learning (RL), DynamicRAG, knowledge-intensive tasks

๐Ÿ’ก Category: Natural Language Processing

๐ŸŒŸ Research Objective:

– To propose DynamicRAG, a novel framework that dynamically adjusts the order and number of documents in retrieval-augmented generation systems to optimize generation quality in knowledge-intensive tasks.

๐Ÿ› ๏ธ Research Methods:

– The reranker in DynamicRAG is modeled as an agent optimized through reinforcement learning, using rewards based on the quality of LLM outputs.

๐Ÿ’ฌ Research Conclusions:

– DynamicRAG achieves state-of-the-art results across seven knowledge-intensive datasets, demonstrating superior performance in knowledge retrieval and generation tasks.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07233

22. INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

๐Ÿ”‘ Keywords: INTELLECT-2, Reinforcement Learning, Distributed Asynchronous Training, Decentralized Training, GRPO

๐Ÿ’ก Category: Reinforcement Learning

๐ŸŒŸ Research Objective:

– To develop INTELLECT-2, a globally distributed RL training run of a 32 billion parameter language model using an asynchronous and decentralized approach.

๐Ÿ› ๏ธ Research Methods:

– Implementation of a novel training framework, PRIME-RL, supported by new components such as TOPLOC for verifying rollouts and SHARDCAST for efficient policy weight broadcasts.

– Modifications to the GRPO training recipe and data filtering techniques for enhanced training stability.

๐Ÿ’ฌ Research Conclusions:

– Successfully improved the state of the art in reasoning models and open-sourced INTELLECT-2 to foster further research in decentralized training methods.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07291

23. LLAMAPIE: Proactive In-Ear Conversation Assistants

๐Ÿ”‘ Keywords: LlamaPIE, real-time proactive assistant, hearable devices, semi-synthetic dialogue dataset, two-model pipeline

๐Ÿ’ก Category: Human-AI Interaction

๐ŸŒŸ Research Objective:

– Introduce LlamaPIE, the first proactive assistant for enhancing human conversations through hearable devices, operating without explicit user invocation.

๐Ÿ› ๏ธ Research Methods:

– Developed a semi-synthetic dialogue dataset and a two-model pipeline with a smaller decision model and a larger response generation model.

๐Ÿ’ฌ Research Conclusions:

– Evaluations show that LlamaPIE provides effective, unobtrusive assistance, with user studies favorably comparing it to a baseline and a reactive model, suggesting its potential in enhancing live conversations.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.04066

24. Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design

๐Ÿ”‘ Keywords: Biomolecule Engineering, Discrete Flow Matching, Multi-Objective, Peptide Generation, DNA Design

๐Ÿ’ก Category: Generative Models

๐ŸŒŸ Research Objective:

– The study aims to design biological sequences that meet multiple functional and biophysical criteria using the Multi-Objective-Guided Discrete Flow Matching (MOG-DFM) framework.

๐Ÿ› ๏ธ Research Methods:

– The authors introduced MOG-DFM, a framework that guides pretrained discrete-time flow matching generators towards Pareto-efficient trade-offs across multiple scalar objectives.

– Two unconditional discrete flow matching models, PepDFM and EnhancerDFM, were trained for generating diverse peptides and functional enhancer DNA, respectively.

๐Ÿ’ฌ Research Conclusions:

– MOG-DFM effectively generates peptide binders optimized across multiple properties including hemolysis, non-fouling, solubility, half-life, and binding affinity.

– The framework proves to be a powerful tool for designing DNA sequences with specific enhancer classes and DNA shapes, demonstrating its versatility and effectiveness in biomolecule design.

๐Ÿ‘‰ Paper link: https://huggingface.co/papers/2505.07086

25.

๐Ÿ‘‰ Paper link: 

Blank Form (#4)
[email protected]

About

Ecosystem

Copyright 2025 AI Native Foundationยฉ . All rights reserved.โ€‹