Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction 2024-10-29 DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation 2024-10-29 MarDini: Masked Autoregressive Diffusion for Video Generation at Scale 2024-10-29 LongReward: Improving Long-context Large Language Models with AI Feedback 2024-10-29 A Survey of Small Language Models 2024-10-29 GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation 2024-10-29 COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training 2024-10-29 Fast Best-of-N Decoding via Speculative Rejection 2024-10-29 Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines 2024-10-29 LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior 2024-10-29 VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks 2024-10-29 Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction 2024-10-29 Neural Fields in Robotics: A Survey 2024-10-29 Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation 2024-10-29 Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA 2024-10-29 Language Models And A Second Opinion Use Case: The Pocket Professional 2024-10-29 Bi-Level Motion Imitation for Humanoid Robots 2024-10-29 ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting 2024-10-28 Continuous Speech Synthesis using per-token Latent Diffusion 2024-10-28 Teach Multimodal LLMs to Comprehend Electrocardiographic Images 2024-10-28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121