Psychology invented evals first

There is a question that keeps the entire AI research community up at night today: how do we know that a model can really do what it seems to do?

How do we distinguish genuine understanding from a very convincing imitation? How do we measure intelligence - not as a philosophical idea, but as a reproducible, reliable result?

This question is not new. Psychology asked it 150 years earlier.

How psychology became a science

In the nineteenth century, psychology was part of philosophy. People reflected on the nature of mind, consciousness and perception - but without measurement, without experiments and without reproducibility.

Everything changed when one question appeared: can this be measured?

Gustav Fechner began measuring sensations. In 1879, Wilhelm Wundt opened the first psychology laboratory and started running controlled experiments on mental processes. Psychology stopped being philosophy and became a science at the moment it learned how to evaluate.

That is when the first evals appeared: tests of memory, attention, perception and intelligence. And immediately the same problem emerged that AI research faces today: how do you make sure you are measuring exactly what you think you are measuring?

Three questions that have not changed

Psychology has the concepts of reliability and validity. Reliability asks whether a test gives stable results when repeated. Validity asks whether it measures what it claims to measure.

It sounds simple. In practice, it is incredibly hard.

How do you build an intelligence test that does not end up measuring familiarity with the cultural codes of one specific country? How do you make sure a participant is solving the task in the way you think they are, rather than in some other way?

The same three questions are at the center of modern AI evals.

A model scores 90% on a benchmark - does that mean it can reason? Or does it mean it saw similar examples in the training data? How do you know that an agent is solving the task rather than finding a shortcut that only looks like a solution?

The methodology is the same. The object is different. The difficulty is the same.

Why this matters right now

As long as AI systems were tools with fixed behavior, evaluation was easier. Agents are different. They make decisions in unfamiliar situations, interact with an environment and adapt.

Evaluating their behavior is no longer about accuracy on a test split. It is about designing environments, tasks and metrics so that the result actually means something.

That is exactly what psychologists were doing when they built the first experimental protocols. And that is exactly what fascinates me in AI research now.

Two different fields. One question

When I work on evaluating AI systems, I do not feel that I am doing something far away from psychology. I feel that two educations in my life have finally converged on the same question.

They are simply approaching it from different sides.

Есть вопрос, который сегодня не даёт покоя всему AI research сообществу: как понять, что модель действительно умеет то, что, кажется, умеет?

Как отличить настоящее понимание от очень убедительной имитации? Как измерить интеллект - не как философскую категорию, а как воспроизводимый, надёжный результат?

Этот вопрос не новый. Психология задала его на 150 лет раньше.

Как психология стала наукой

В XIX веке психология была частью философии. Рассуждения о природе ума, сознания, восприятия - но без измерений, без эксперимента, без воспроизводимости.

Всё изменилось, когда появился вопрос: а можно ли это измерить?

Густав Фехнер начал измерять ощущения. Вильгельм Вундт в 1879 году открыл первую психологическую лабораторию и стал проводить контролируемые эксперименты над психическими процессами. Психология перестала быть философией и стала наукой - в тот момент, когда научилась оценивать.

Именно тогда появились первые evals: тесты памяти, внимания, восприятия, интеллекта. И сразу встала та же проблема, что стоит сегодня перед AI research: как убедиться, что ты измеряешь именно то, что думаешь?

Три вопроса, которые не изменились

В психологии есть понятия надёжности и валидности. Надёжность - даёт ли тест стабильный результат при повторении? Валидность - измеряет ли он то, что заявлено?

Это звучит просто. На практике - невероятно сложно.

Как построить тест интеллекта, который не измеряет при этом знание культурных кодов конкретной страны? Как убедиться, что испытуемый решает задачу так, как ты думаешь, а не как-то иначе?

Те же три вопроса - в центре современных AI evals.

Модель набирает 90% на бенчмарке - это значит, что она умеет рассуждать? Или что она видела похожие примеры в обучающих данных? Как понять, что агент решает задачу, а не находит обходной путь, который выглядит как решение?

Методология та же. Объект другой. Сложность - та же самая.

Почему это важно именно сейчас

Пока AI-системы были инструментами с фиксированным поведением, оценка была проще. Агенты - другое дело. Они принимают решения в незнакомых ситуациях, взаимодействуют с окружением, адаптируются.

Оценивать их поведение - это уже не про accuracy на тестовой выборке. Это про то, как проектировать среду, задачи и метрики так, чтобы результат что-то значил.

Именно этим занимались психологи, когда строили первые экспериментальные протоколы. И именно это меня сейчас захватывает в AI research.

Два разных поля. Один вопрос

Когда я работаю с оценкой AI-систем, я не чувствую, что занимаюсь чем-то далёким от психологии. Я чувствую, что два образования наконец занялись одним и тем же вопросом.

Просто с разных сторон.