Why AI benchmarks mislead buyers

That #1 spot on a public AI leaderboard can jump eight places if you shuffle multiple-choice answers. Researchers showed how fragile those scores are in a 2024 study (Alzahrani et al., 2024). If you're picking a model or a GPU from the chart alone, you're gambling. A strong score on a standard knowledge test won't tell you how laggy your app feels when real users pile on—or how slow a search-augmented chatbot gets when it has to look things up first.

Leaderboard scores wobble more than the chart suggests

Public leaderboards rank large language models on shared tests so everyone can compare them. In a 2024 ACL paper, researchers reran popular multiple-choice exams with small changes: they reordered answer choices, switched how the software picked A/B/C/D, and tried different scoring rules. Model ranks moved as many as eight slots without changing the underlying AI. The model sitting at #3 on a blog chart could plausibly land at #11 under your test setup.

Why does that matter if you're shopping? Because vendors and reviewers quote those ranks like they're stable product ratings. They're closer to a photo finish where the camera angle changes who looks ahead.

Copying someone else's score is harder than it sounds. A large 2024 review of how the field tests AI found that many papers skip basic details—exact prompts, decoding settings, which slice of the data ran, how answers were parsed (Laskar et al., 2024). A separate reproducibility project built shared evaluation software and still found teams couldn't match published numbers when setups differed (Biderman et al., 2024).

Scores on famous public tests can also look better than fresh data shows. Haimes et al. (2024) built holdout tests from older public Q&A data that models were less likely to have memorized, then compared twenty models. Some jumped up to 16 percentage points on the familiar public set versus the holdout. That pattern fits training data overlapping test questions. If a vendor slides TruthfulQA into a deck without running a private rerun, treat the number as marketing until you verify it yourself.

Chip benchmarks use a different race than your server

MLPerf Inference is the industry's standard benchmark suite for AI chips and servers. It runs fixed workloads in several modes. Offline mode pushes maximum throughput, like filling the GPU with as much work as the lab allows. Server mode adds latency limits and random arrival timing—closer to users hitting an API at unpredictable moments (Reddi et al., 2020). Reddi et al. (2020) reported clearly lower throughput in server mode because the system can't batch work as aggressively when it has to keep response times in check.

Benchmark rules also ban shortcuts that don't exist in the wild. MLCommons Association (2024) forbids caching answers keyed to specific test question IDs—fine for a repeatable lab run, useless when every customer prompt is different. Reddi et al. (2020) rule out caches tuned only for benchmark data. Your production stack might use tricks that make MLPerf look worse while your app feels faster.

Trade press coverage has long noted another gap: MLPerf tables highlight raw inference speed, while power draw and hardware price often sit outside the same scoreboard (Feldman, 2022). The chip that wins "tokens per second" in a slide can lose once you pay for electricity, cooling, and the full server around the GPU.

Real apps add waits the lab never measured

Vendor slides usually fix prompt length, batch size, and precision. Your product won't.

Time to first token (TTFT) is how long someone waits before the model starts typing back. End-to-end latency is the full wait until the answer finishes. NVIDIA's public benchmarking guide (2025) explains that longer prompts delay TTFT because the system must read the entire prompt before generation starts. When many users connect at once, requests queue and overlap—behavior a single-user lab test often misses.

RAG (retrieval-augmented generation) means the model searches a document store, pulls relevant chunks, then writes an answer. That search-and-stitch step adds time before the model even starts talking. Shen et al. (2024) measured RAG pipelines and found retrieval taking about 41% of total response time and 45%–47% of TTFT in their setups. In aggressive configurations they tested, total wait stretched toward 30 seconds. Jiang et al. (2025) showed that tuning how retrieval and generation work together can double queries per chip and cut TTFT by 55% in their system—useful proof that "tokens per second" on a bare model misses most of the bill for search-heavy apps.

p99 latency is the slowest 1% of requests—the waits power users and spikes hit. Average latency can look fine while p99 doubles once a queue backs up. NVIDIA (2025) and MLPerf's server scenarios both treat tail waits as part of the score, not optional extras.

Some models only activate part of their parameters per token (mixture of experts), so a "70 billion parameter" label on a spec sheet doesn't map cleanly to memory use or speed (Laskar et al., 2024). Winning one public test often doesn't mean winning the next; benchmarking only a general knowledge exam doesn't tell you how a support-chat bot performs on your tickets.

Open-source serving tools expose knobs for input length, batch size, and concurrency for a reason (vLLM Project, 2024). Skip those settings and you're measuring the vendor's demo, not your traffic.

Six things to check before you trust a number

Test with your real prompts and user load. Use sample chats or logs that match how long questions are, how long answers run, and how many people connect at once. Match the quality settings you plan to ship (Biderman et al., 2024; NVIDIA, 2025).
Look at slow requests, not just the average. Track p95 and p99 latency and TTFT under load. The average can hide painful tail waits (NVIDIA, 2025; Reddi et al., 2020).
Write down how you tested. Model version, prompts, random seeds, batching—enough that you or your team can rerun the same test next month (Laskar et al., 2024).
Shake multiple-choice tests if you use them. Reorder answers and try alternate scoring. If ranks swing several places, the "winner" was fragile (Alzahrani et al., 2024).
Add up the full bill. Search, embeddings, reranking, bandwidth, monitoring, human review, and backup capacity—not just GPU speed on a slide (Feldman, 2022; Shen et al., 2024).
Use fresh or private tests when the public set is everywhere. If models may have seen the questions during training, run holdout-style checks before you bet a launch on the score (Haimes et al., 2024).

Public leaderboards and MLPerf still help you compare chips and research models in controlled conditions. They don't replace measuring what your users actually send through the system. Run your own bench—or you're buying against the marketing slide, not your app.

References

Alzahrani, N., Alyahya, H., Alnumay, Y., AlRashed, S., Alsubaie, S., Almushayqih, Y., Mirza, F., Alotaibi, N., Al-Twairesh, N., Alowisheq, A., Bari, M. S., & Khan, H. (2024). When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13787–13805). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.744

Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H., Lovering, C., Muennighoff, N., Pavlick, E., Phang, J., Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, G. I., Yvon, F., & Zou, A. (2024). Lessons from the trenches on reproducible evaluation of language models (arXiv:2405.14782). arXiv. https://arxiv.org/abs/2405.14782

Feldman, M. (2022, April 8). The performance of MLPerf as a ubiquitous benchmark is lacking. The Next Platform. https://www.nextplatform.com/ai/2022/04/08/the-performance-of-mlperf-as-a-ubiquitous-benchmark-is-lacking/1655949

Haimes, J., Wenner, C., Thaman, K., Tashev, V., Neo, C., Kran, E., & Hoelscher-Obermaier, J. (2024). Benchmark inflation: Revealing LLM performance gaps using retro-holdouts. OpenReview. https://openreview.net/forum?id=WdA5H9ARaa

Jiang, W., Subramanian, S., Graves, C., Alonso, G., Yazdanbakhsh, A., & Dadu, V. (2025). RAGO: Systematic performance optimization for retrieval-augmented generation serving (arXiv:2503.14649). arXiv. https://arxiv.org/abs/2503.14649

Laskar, M. T. R., Alqahtani, S., Bari, M. S., Rahman, M., Khan, M. A. M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C. W., Parvez, M. R., Hoque, E., Joty, S., & Huang, J. (2024). A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations (arXiv:2407.04069). arXiv. https://arxiv.org/abs/2407.04069

MLCommons Association. (2024). MLPerf inference rules. GitHub. https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc

NVIDIA. (2025). Metrics—NVIDIA NIM LLMs benchmarking. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html

Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, D., Diamos, G., Duke, J., Fick, D., Gardner, J. S., Hubara, I., Idgunji, S., Jablin, T. B., Jiao, J., John, T. S., Kanwar, P., Lee, D., Liao, J., Lokhmotov, A., Massa, F., Meng, P., Micikevicius, P., Osborne, C., Pekhimenko, G., Rajan, A. T. R., Sequeira, D., Sirasao, A., Sun, F., Tang, H., Thomson, M., Wei, F., Wu, E., Xu, L., Yamada, K., Yu, B., Yuan, G., Zhong, A., Zhang, P., & Zhou, Y. (2020). MLPerf inference benchmark. IEEE Micro, 40(3), 8–16. https://arxiv.org/abs/1911.02549

Shen, M., Umar, M., Maeng, K., Suh, G. E., & Gupta, U. (2024). Towards understanding systems trade-offs in retrieval-augmented generation model inference (arXiv:2412.11854). arXiv. https://arxiv.org/abs/2412.11854

vLLM Project. (2024). vllm bench throughput. vLLM Documentation. https://docs.vllm.ai/en/latest/cli/bench/throughput/

Why AI Benchmarks Mislead Buyers

Why AI benchmarks mislead buyers

Leaderboard scores wobble more than the chart suggests

Chip benchmarks use a different race than your server

Real apps add waits the lab never measured

Six things to check before you trust a number

References

Cursor Review - UI, Composer Models, and Where It Still Stumbles

Welcome to RuntimeBuzz

Cool Web Tools I've Been Bookmarking Lately

Pick your next read