Yutan
Yutan @yutaaaalll ·
これ良いまとめ。静的ベンチマークはもう限界で、マルチターン推論を測れるインタラクティブなベンチマークが本流になりつつある。Terminal BenchやBALROGみたいな対話型評価が増えてきたのは自然な流れ。 #AI #CodingAgent #Benchmark
Greg Kamradt Greg Kamradt @GregKamradt ·
The world is moving towards agents Static benchmarks don't measure what agents do best (multi-turn reasoning) Thus, interactive benchmarks: * Terminal Bench (@alexgshaw, @Mike_A_Merrill) * Text Arena (@LeonGuertler) * BALROG (@PaglieriDavide, @_rockt) * ARC-AGI-3 (@arcprize)
1
34
Zandor Khan
Zandor Khan @Zandor_Khan ·
He creado un #benchmark semántico para poner a prueba a las #IA #AI Decirle tras su respuesta que era un benchmark semántico para ponerlas a prueba, es parte del benchmark. Sirve para ver si son loros estocásticos, si se atrancan en cambios de contexto, y otras pruebas:
1
29
Armin Parchami
Armin Parchami @ArminPCM ·
Exciting release and congrats to @fredsala and @devjeetrr! Our team @SnorkelAI is excited to support such impactful research projects around coding agents. #AISlop #CodingAgents #benchmark
Gabe Orlanski Gabe Orlanski @GOrlanski ·
We found that agents generate progressively worse code with each iteration. Real developers do not. SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks. arxiv.org/abs/2603.24755 scbench.ai 🧵c
1
323
AI Brief
AI Brief @AiMonPod ·
Replying to @AiMonPod
@YouTube 1/5 Breaking News in AI! The ARC-AGI-3 benchmark, designed to test AGI capabilities, has left even the world's top AI models stumped, with the best scoring only 0.37%! Gemini Pro leads the pack, but still has a long way to go. #AI #AGI #Benchmark
1
21
guruhitech
guruhitech @guruhitech1 ·
Google Gemini 3.1 Flash Live, la nuova IA con voce sempre più umana #anthropic #benchmark #claude #gemini #geminilive #google #intelligenzaartificiale #reteneurale #voce 📷guruhitech.com/google-gemini-…U
Google Gemini 3.1 Flash Live, la nuova IA con voce sempre più umana | GuruHiTech

Google presenta Gemini 3.1 Flash Live, il nuovo modello di intelligenza artificiale per la sintesi vocale con una voce più naturale e conversazioni più fluide.

From guruhitech.com
1
86
Himanshu
Himanshu @MeFounderguy ·
Arcade launched ToolBench—benchmark that rates MCP server quality across enterprise apps. Slack, Workday, Meta's ad platform, WhatsApp are the most closed off to AI agents. GitHub and Figma are most open. The platform access war has measurable winners now. #AI #MCP #Benchmark
77
PHOENIX MEDIA
PHOENIX MEDIA @phoenixmedia_eu ·
🎙️Last but not least – Anton Gudkov Er gibt uns unter anderem Einblicke in den Einsatz von Claude-AI-Modellen im Dev-Kontext und Benchmark-Tests unterschiedlicher Einstellungen und Regelwerke. 🧠 #AI #DeveloperExperience #Benchmark #Ecommerce
25
万物燃烧
万物燃烧 @hydra_ksxh ·
LLM Success Rate Leaderboard: 1. Claude Sonnet 4.6 - 86.9% 2. Claude Opus 4.6 - 86.3% 3. GPT-5.4 - 86.0% 4. Nvidia Nemotron-3-Super-120B - 85.6% 5. Claude Opus 4.5 - 85.4% 6. Kimi K2.5 - 93.4% 7. MiniMax M2.5 - 35.5% 8. GLM-5 - 80%+ #AI #LLM #Benchmark
1
231