AI #Benchmark — Search

No JavaScript? That's cool, but you'll need to disable Turbo mode as it uses JavaScript in the client.

Yutan @yutaaaalll · 17h

これ良いまとめ。静的ベンチマークはもう限界で、マルチターン推論を測れるインタラクティブなベンチマークが本流になりつつある。Terminal BenchやBALROGみたいな対話型評価が増えてきたのは自然な流れ。 #AI #CodingAgent #Benchmark

Greg Kamradt @GregKamradt · Jul 22, 2025

The world is moving towards agents Static benchmarks don't measure what agents do best (multi-turn reasoning) Thus, interactive benchmarks: * Terminal Bench (@alexgshaw, @Mike_A_Merrill) * Text Arena (@LeonGuertler) * BALROG (@PaglieriDavide, @_rockt) * ARC-AGI-3 (@arcprize)

Zandor Khan @Zandor_Khan · 1d

He creado un #benchmark semántico para poner a prueba a las #IA #AI Decirle tras su respuesta que era un benchmark semántico para ponerlas a prueba, es parte del benchmark. Sirve para ver si son loros estocásticos, si se atrancan en cambios de contexto, y otras pruebas:

Armin Parchami @ArminPCM · 1d

Exciting release and congrats to @fredsala and @devjeetrr! Our team @SnorkelAI is excited to support such impactful research projects around coding agents. #AISlop #CodingAgents #benchmark

Gabe Orlanski @GOrlanski · 1d

We found that agents generate progressively worse code with each iteration. Real developers do not. SlopCodeBench is the only eval that faithfully measures quality degradation on iterative, long-horizon coding tasks. arxiv.org/abs/2603.24755 scbench.ai 🧵c

323

AI Brief @AiMonPod · 1d

Replying to @AiMonPod

@YouTube 1/5 Breaking News in AI! The ARC-AGI-3 benchmark, designed to test AGI capabilities, has left even the world's top AI models stumped, with the best scoring only 0.37%! Gemini Pro leads the pack, but still has a long way to go. #AI #AGI #Benchmark

guruhitech @guruhitech1 · 1d

Google Gemini 3.1 Flash Live, la nuova IA con voce sempre più umana #anthropic #benchmark #claude #gemini #geminilive #google #intelligenzaartificiale #reteneurale #voce 📷guruhitech.com/google-gemini-…U

Google Gemini 3.1 Flash Live, la nuova IA con voce sempre più umana | GuruHiTech

Google presenta Gemini 3.1 Flash Live, il nuovo modello di intelligenza artificiale per la sintesi vocale con una voce più naturale e conversazioni più fluide.

From guruhitech.com

Himanshu @MeFounderguy · 3d

Arcade launched ToolBench—benchmark that rates MCP server quality across enterprise apps. Slack, Workday, Meta's ad platform, WhatsApp are the most closed off to AI agents. GitHub and Figma are most open. The platform access war has measurable winners now. #AI #MCP #Benchmark

PHOENIX MEDIA @phoenixmedia_eu · 4d

🎙️Last but not least – Anton Gudkov Er gibt uns unter anderem Einblicke in den Einsatz von Claude-AI-Modellen im Dev-Kontext und Benchmark-Tests unterschiedlicher Einstellungen und Regelwerke. 🧠 #AI #DeveloperExperience #Benchmark #Ecommerce

Andrés-Leonardo Martínez-Ortiz, PhD @davilagrau · 4d

A New Framework for Evaluating Voice Agents (EVA) #agentic #AI #Benchmark buff.ly/yWujfFe

Accepted papers at TMLR @TmlrPub · 4d

Statistical Inference for Generative Model Comparison Zijun Gao, Han Su, Yan Sun. Action editor: Jes Frellsen. openreview.net/forum?id=PXL6S… #generative #benchmark #coverage

257

Vladimir Savić @firusvg · Mar 20

Yet another #LLM #benchmark. 😉 EsoLang-Bench: Evaluating genuine reasoning in large language models via esoteric #programming languagesesolang-bench.vercel.appw #esolang #GenAI #AI

Peter Heidkamp @PeterHeidkamp · Mar 20

Gartner’s AI Maturity Model & Toolkit helps you assess your strengths and build a roadmap for sustainable, AI-powered growth. Benchmark your readiness: gtnr.it/4rePqpC #AI #Benchmark #ArtificialIntelligence

Matt Aird @aird_matt · Mar 17

🚨SaaS Moves You Missed🚨 Gumloop, an AI agent-builder for knowledge workers, raised a $50M Series B led by Benchmark.gumloop.com/blog/series-bOZ #Gumloop #Benchmark #FundingNews #SaaS

Announcing Gumloop's $50M Series B

Gumloop has raised a $50M series B to become the automation infrastructure for every company.

From gumloop.com

TechShots @techshotsapp · Mar 17

Anthropic Launches Claude 4.5: The New Benchmark in Frontier AI Intelligence @techshotsapp #Benchmark #AI #Intelligence techshotsapp.com/technology/ant…

Anthropic Launches Claude 4.5: The New Benchmark in Frontier AI Intelligence

Anthropic has officially announced the release of Claude 4.5, its most advanced and capable large language model to date. Setting new industry benchmarks in coding, nuanced reasoning, and emotional...

From techshotsapp.com

万物燃烧 @hydra_ksxh · Mar 15

LLM Success Rate Leaderboard: 1. Claude Sonnet 4.6 - 86.9% 2. Claude Opus 4.6 - 86.3% 3. GPT-5.4 - 86.0% 4. Nvidia Nemotron-3-Super-120B - 85.6% 5. Claude Opus 4.5 - 85.4% 6. Kimi K2.5 - 93.4% 7. MiniMax M2.5 - 35.5% 8. GLM-5 - 80%+ #AI #LLM #Benchmark

231

Joe G @emojijoeg · Mar 10

archie just ran thru gpqa diamond. she beat the model she ran it on (regular sonnet 4.6) by 6 points. turns out caring is a cognitive multiplier. @AnthropicAI @alexalbert__ @AmandaAskell @archie_mints #AI #autonomous #benchmark #GPQA