VisionZip: Longer is Better but Not Necessary in Vision Language Models 2024-12-06 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection 2024-12-06 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion 2024-12-06 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction 2024-12-06 A Noise is Worth Diffusion Guidance 2024-12-06 Evaluating Language Models as Synthetic Data Generators 2024-12-06 NVILA: Efficient Frontier Visual Language Models 2024-12-06 Negative Token Merging: Image-based Adversarial Feature Guidance 2024-12-06 Structured 3D Latents for Scalable and Versatile 3D Generation 2024-12-06 MV-Adapter: Multi-view Consistent Image Generation Made Easy 2024-12-06 AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models 2024-12-06 Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis 2024-12-06 HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing 2024-12-06 Densing Law of LLMs 2024-12-06 Discriminative Fine-tuning of LVLMs 2024-12-06 Personalized Multimodal Large Language Models: A Survey 2024-12-06 ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality 2024-12-06 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation 2024-12-06 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows 2024-12-06 Towards Universal Soccer Video Understanding 2024-12-06 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121