Data Scaling Laws in Imitation Learning for Robotic Manipulation 2024-10-25 Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits 2024-10-25 MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models 2024-10-24 WorldSimBench: Towards Video Generation Models as World Simulators 2024-10-24 Scaling Diffusion Language Models via Adaptation from Autoregressive Models 2024-10-24 DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes 2024-10-24 Scalable Ranked Preference Optimization for Text-to-Image Generation 2024-10-24 Lightweight Neural App Control 2024-10-24 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding 2024-10-24 ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding 2024-10-24 MedINST: Meta Dataset of Biomedical Instructions 2024-10-24 M-RewardBench: Evaluating Reward Models in Multilingual Settings 2024-10-24 TP-Eval: Tap Multimodal LLMs’ Potential in Evaluation by Customizing Prompts 2024-10-24 LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias 2024-10-24 PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction 2024-10-23 SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes 2024-10-23 Aligning Large Language Models via Self-Steering Optimization 2024-10-23 JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation 2024-10-23 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs 2024-10-23 Improve Vision Language Model Chain-of-thought Reasoning 2024-10-23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121