← All posts

Psychology invented evals first

There is a question that keeps the entire AI research community up at night today: how do we know that a model can really do what it seems to do?

How do we distinguish genuine understanding from a very convincing imitation? How do we measure intelligence - not as a philosophical idea, but as a reproducible, reliable result?

This question is not new. Psychology asked it 150 years earlier.

How psychology became a science

In the nineteenth century, psychology was part of philosophy. People reflected on the nature of mind, consciousness and perception - but without measurement, without experiments and without reproducibility.

Everything changed when one question appeared: can this be measured?

Gustav Fechner began measuring sensations. In 1879, Wilhelm Wundt opened the first psychology laboratory and started running controlled experiments on mental processes. Psychology stopped being philosophy and became a science at the moment it learned how to evaluate.

That is when the first evals appeared: tests of memory, attention, perception and intelligence. And immediately the same problem emerged that AI research faces today: how do you make sure you are measuring exactly what you think you are measuring?

Three questions that have not changed

Psychology has the concepts of reliability and validity. Reliability asks whether a test gives stable results when repeated. Validity asks whether it measures what it claims to measure.

It sounds simple. In practice, it is incredibly hard.

How do you build an intelligence test that does not end up measuring familiarity with the cultural codes of one specific country? How do you make sure a participant is solving the task in the way you think they are, rather than in some other way?

The same three questions are at the center of modern AI evals.

A model scores 90% on a benchmark - does that mean it can reason? Or does it mean it saw similar examples in the training data? How do you know that an agent is solving the task rather than finding a shortcut that only looks like a solution?

The methodology is the same. The object is different. The difficulty is the same.

Why this matters right now

As long as AI systems were tools with fixed behavior, evaluation was easier. Agents are different. They make decisions in unfamiliar situations, interact with an environment and adapt.

Evaluating their behavior is no longer about accuracy on a test split. It is about designing environments, tasks and metrics so that the result actually means something.

That is exactly what psychologists were doing when they built the first experimental protocols. And that is exactly what fascinates me in AI research now.

Two different fields. One question

When I work on evaluating AI systems, I do not feel that I am doing something far away from psychology. I feel that two educations in my life have finally converged on the same question.

They are simply approaching it from different sides.