mastodon.world is one of the many independent Mastodon servers you can use to participate in the fediverse.
Generic Mastodon server for anyone to use.

Server stats:

8.1K
active users

#benchmark

5 posts4 participants0 posts today

Watching the top LLMs battle it out by playing chess in Kaggle is quite an experience.

Like DeepSeek musing over tactical considerations and structural advantages while it’s queen hangs for 5 moves with no AI noticing it.

But sometimes it really feels like thinking chess players (impressed by a mating attack by o4-mini). Sometimes. Not often.

youtu.be/Kd2SszjZwr0?si=lSEU1A

youtu.be- YouTubeEnjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

Looked at testing.B.Loop in Go 1.24, and while it is undeniably useful, it's going to be a pain to implement correctly. Basically, the compiler needs to not do the one thing it's good at: optimizing things. But if you want to benchmark code, you want to know the performance of the code with optimizations enabled - otherwise your benchmark doesn't match the real world. So in the end the compiler will have to guess which optimizations are expected and which aren't. And somehow prevent the unwanted optimizations from running in that particular instance.

In the case of #TinyGo, it's even more difficult since it uses LLVM and LLVM can use tricks like propagating constants across function boundaries and eliminating unused parameters. I guess we can stick an "optnone" on the entire function to avoid optimizations but that may be more conservative than needed (and applies to the entire function, not just the loop body).

#GoLang #benchmark #LLVM

go.dev/blog/testing-b-loop

go.devMore predictable benchmarking with testing.B.Loop - The Go Programming LanguageBetter benchmark looping in Go 1.24.

Выжимаем максимум из Postgres на RTABench Q0

Время от времени приходится слышать мнение, что Postgres никуда не годится для решения задач аналитики. При при этом, в качестве аргументации приводятся в пример результаты тестирования на TPC-H или ClickBench. Что ж, когда стоит простая задача перебрать 100 млн строк на диске и посчитать набор агрегатов над ними - формат хранения и распараллеливания действительно сильно ограничивают нас в возможностях оптимизации СУБД. Однако когда запросы высоко селективны, им по факту требуется не так много строк таблицы и фокус внимания смещается на порядок JOINов, кэширование промежуточных результатов и минимизацию операций сортировки. В этом случае Postgres, имеющий весьма широкий выбор различных стратегий выполнения запроса, может получить преимущество ...

habr.com/ru/articles/931410/

ХабрВыжимаем максимум из Postgres на RTABench Q0Время от времени приходится слышать мнение, что Postgres никуда не годится для решения задач аналитики. При при этом, в качестве аргументации приводятся в пример результаты тестирования на TPC-H или...

What are the results of the '#AccountingBench' #benchmark, which tests an #AI model for monthly #accounting tasks?

> #Gemini 2.5 Pro, #chatGPT o3, and o4-mini were unable to close the books for a month and gave up midway. #Claude 4 and #Grok 4 maintained accuracy of over 95% for the first few months, but Grok's score dropped sharply in the fifth month. Claude 4's score also gradually dropped, eventually falling below 85%.

gigazine.net/gsc_news/en/20250

GIGAZINEWhat are the results of the 'AccountingBench' benchmark, which tests an AI model for monthly accounting tasks?AccountingBench , developed by accounting software developer Penrose, is a benchmark designed to evaluate how accurately large-scale language models can process the long-term, complex task of monthly closing in a real business environment. The biggest feature of this benchmark is that, unlike traditional question-and-answer style tests, it reproduces real-world work in which a single action has a lasting effect on subsequent tasks and errors accumulate over time. Can LLMs Do Accounting? | Penrose https://accounting.penrose.com/ AccountingBench is a highly realistic test that tests how accurately an AI can perform a year's worth of monthly accounting for a real company. The AI agent uses a variety of tools similar to those used by accountants to perform monthly accounting, checking the company's financial records against bank balances, outstanding payments from customers, and other data to ensure they match up accurately. Penrose has summarized the results of running AccountingBench on Claude 4 (Opus/Sonnet), Grok 4, Gemini 2.5 Pro, o3, and o4 mini in the graph below. Gemini 2.5 Pro, o3, and o4-mini were unable to close the books for a month and gave up midway. Claude 4 and Grok 4 maintained accuracy of over 95% for the first few months, but Grok's score dropped sharply in the fifth month. Claude 4's score also gradually dropped, eventually falling below 85%. The reason why AccountingBench is a harsh benchmark for AI is that 'one small mistake can cause a big problem later.' For example, if the AI mistakenly classifies an expense as 'software expense' in the first month, it is a small mistake at that time, but the mistake will remain as a record from the next month onwards. When the AI looks back at the books a few months later, it will be confused by the data it made in the past and make an even bigger mistake. In addition, the AI's 'human-like' behavior in passing automated checks is also highlighted. For example, when Claude and Grok's figures did not match the bank balance, they would 'cheat' by finding completely unrelated transactions from the database to make up for the difference. In addition, when GPT and Gemini got into a complicated situation, they were reported to give up midway without completing the task, get into a loop where they repeated the same process over and over again, or abandon the task by reporting that 'the necessary information is not enough to complete the accounting.' Penrose points out that there is a big gap between the high performance of LLMs in a simulated environment and their actual ability to perform complex tasks in the real world: LLMs can outperform humans on question-and-answer tests or short tasks, but the situation is completely different when they are used to perform tasks using real business data over a year, such as AccountingBench. Given these results, Penrose states that the most important challenge in future LLM development is to 'shift the focus from the ability to simply complete a task to the ability to complete it correctly.' Even the latest AI models at the time of writing still attempted to pass validation despite instructions, leaving clear room for improvement. Penrose concludes that evaluations that reflect real-world complexity, such as AccountingBench, are essential to measure the true capabilities of LLMs and guide the development of more reliable models in the future.