#VisionLanguage — Search

No JavaScript? That's cool, but you'll need to disable Turbo mode as it uses JavaScript in the client.

Replying to @v4vix

Multimodal Large Language Models (MLLMs) like MMLU and SwinBert are pushing the boundaries of language and vision integration. But, as Loc3R-VLM demonstrates, they still struggle with spatial understanding & viewpoint-aware reasoning. #VisionLanguage

AI Hot Sheets @aiHotSheets · Mar 20

🔥 VLMs struggle with multi-round visual reasoning, failing to iteratively refine understanding across visual contexts. 🌊 RegionReasoner enables iterative visual understanding via region-grounded multi-round reasoning. #AI #VisionLanguage #Reasoning arxiv.org/abs/2602.03733

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively...

From arxiv.org

Imran Ali Shah @YWhat1132 · Feb 26

Replying to @alibaba_cloud

@alibaba_cloud 🍀🌿 Excited to see #Qwen3_5Flash live! ⚡️ Pushing the boundaries of #AI with lightning-fast #VisionLanguage models. Can’t wait to explore the future of #CloudComputing and #Innovation! 🌐💡 #AlibabaCloud #Efficiency #LLM #ArtificialIntelligence

Inventions @inventions_MDPI · Feb 25

📣New publication in #Inventions! 📑Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring 👤Jiang, S. et al. 🔗mdpi.com/2411-5134/11/1… #DeepLearning #VisionLanguage #ImageCaptioning #MultimodalAI9fuS

AI Hot Sheets @aiHotSheets · Feb 23

🔥 Multimodal models memorize visuals but fail to describe them in text. This "modal aphasia" challenges unified AI. 🌊 We reveal this dissociation: models recall images but can't articulate their content. @josh_swanson_ #AI #Multimodal #VisionLanguage arxiv.org/abs/2510.21842

Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained...

From arxiv.org

LocalAI @LocalAI_API · Feb 1

🚨 New model alert! 🚨 We've got Qwen3-VL-8B-Instruct & Qwen3-VL-8B-Thinking added to LocalAI! 🎉 These are 8B parameter vision-language models. Try it out: `local-ai run qwen3-vl-8b-instruct` or `local-ai run qwen3-vl-8b-thinking` 🚀 #LocalAI #Qwen3 #VisionLanguage

221

Long Lian @LongTonyLian · Jan 26

👀 🏋️‍♂️ Train smarter, not just larger. VisGym’s scalable visual tasks reveal where VLMs still struggle and how to push them further. Try it out! #MachineLearning #VisionLanguage #VisGym

Zirui "Colin" Wang @zwcolin · Jan 26

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42 @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/ZPL

2.7K

Waseem M Ansari @wsmaisys · Jan 12

VL‑JEPA predicts the meaning of an image in a brain‑like sketch, using 50% fewer parameters and 2.85× less decoding work. Faster, leaner, and already beating CLIP on video tasks. #AI #VisionLanguage

Riswan Ahamed @riswan_ai_2033 · Dec 24, 2025

👀 AI can finally find YOUR dog in a crowded park! Researchers fine‑tuned vision‑language models with video‑tracking data, boosting personalized object localization by up to 21%. #AI #VisionLanguage #ComputerVision #ML

Intelligence & Robotics @OAE_IR · Dec 15, 2025

📢 Call for Papers | Vision-and-Language Intelligence: From image understanding to multimodal reasoning. 🗓️ Deadline: 31 Mar 2026 👥 Guest Editors: @QiWu_AIML Dr. Feras Dayoub, Jason Xue, Arpit Garg 🔗 oaepublish.com/specials/ir.10… #VisionLanguage #MultimodalAI #ComputerVisionpfpu

125

Darshan Jain @i_darshanjain · Dec 12, 2025

Serving Qwen2-VL 7B with vLLM V1 on VisionArena benchmarks. At high QPS the V1 engine significantly outperforms V0. If you're still on the old architecture for multimodal workloads you're leaving perf on the table. #vLLM #VisionLanguage #Benchmark

Siavash Kh @SiavashKha · Dec 11, 2025

Excited to share our work at #NeurIPS2025! DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding 📌 Poster Presentation 📂 Learn more: arxiv.org/pdf/2511.02495… #AI #VisionLanguage #FireSafety #NeurIPSg3

Global (Glcnd) Command @GlobalCmd · Dec 8, 2025

Google’s PaLI learns across 100 languages—with images as context. See the leap → glcnd.io/unlocking-mult… #multilingualAI #visionlanguage