Why Two Assessors Watching the Same Performance Give Different Scores

Try this: Take two of your most experienced assessors, sit them side by side, and have them independently score the same trainee performing a task — an intubation, a hose advance, a weld, whatever your world involves. Don't let them confer. Then compare.

If you've done this before, you know what tends to happen. The scores don't match. Sometimes by a little. Sometimes by a lot.

Well — that should bother us. Because if two qualified people watching the same performance reach different conclusions, then a trainee's score depends partly on their skill and partly on the luck of who happened to be holding the clipboard. The technical name for this is “inter-rater reliability” — the degree to which independent assessors agree on the same performance. It sounds dry and academic, and so it gets waved past. That's a mistake. In skill assessment, it isn't a side issue. It is the issue.

Here's why. An assessment is a measurement. If I weigh a bag of flour on two scales and get two numbers, I trust neither — a measurement is supposed to reflect the thing being measured, not the instrument. An assessment that changes with the assessor isn't really measuring the trainee at all. It's measuring some unpredictable blend of the trainee and whoever showed up that day. An unreliable measurement can't be used for anything important. It cannot tell us if the person is qualified. It cannot tell us if our training is working. It cannot tell us if our training is improving or getting worse.

So why does it happen?

It's tempting to assume someone is doing it wrong — too soft, too harsh, not paying attention. Occasionally that's true. Mostly it isn't. Inter-rater disagreement is usually produced by good, conscientious experts doing their honest best. The problem isn't the people. It's the assessment system we've handed them.

Two assessors watching the same intubation aren't really watching the same thing. They're watching the same event, but filtering it through different mental models of what “good” looks like, different thresholds for “good enough,” and different readings of whatever rubric — if any — they were given. Left to supply their own definitions, people supply surprisingly varied ones. That's not a character flaw. It's the default state of any process that leans on expert judgment without giving it enough structure to converge. And that is actually good news: if the problem is structural, it's fixable structurally. We don't need better people. We need a better system around the good people we already have.

The four levers that actually work

First — rubric design. A weak rubric asks for a 1-to-4 rating on “airway management” and leaves the assessor to decide what a 3 means. That's not a rubric; it's a container for private opinion. A strong rubric defines, in concrete observable terms, what each level actually looks like — pushing the judgment off the assessor and into the rubric itself.

Second — indicator-level specificity. “Demonstrates situational awareness” is broad and vague; two assessors will read it two ways. Instead, think “Scans the monitor before administering the medication.” This either happened or it didn't. Two people are much more likely to agree on whether something happened, than on whether it constitutes satisfactory performance. If you break fuzzy global competencies into small, observable indicators then you shrink the space in which assessors can disagree. This is probably the most important and most effective thing you can do to achieve consistency. It is a much more objective way to gather data and remove subjectivity. It even works when there are excellent rubrics — but where people are not reading them. If you do nothing else — change your forms so, as much as possible, they collect observations rather than judgements.

Third — assessor calibration. Have your assessors score the same performance independently, then surface the disagreements and talk them through. Why a 2 here when you gave a 4? That conversation is where a shared standard actually gets built — not by memo, but by experienced people arguing productively over real cases. And it has to recur, because calibration drifts and new assessors arrive.

And finally — tooling that enforces consistency. A beautiful rubric dies quietly on a paper form each assessor fills out differently and files in a drawer. Good tooling presents every assessor the same indicators in the same order, requires each be addressed, and captures results in a structured way you can actually compare — and measure over time. (Full disclosure: that's a large part of what my company works on, so weigh my enthusiasm accordingly. The point stands without any product.)

You'll never drive disagreement to zero. Complex skill assessment always involves some amount of expert judgment, and a little variation among experts is healthy. The goal isn't perfect agreement — it's eliminating the disagreement that comes from vague rubrics, fuzzy indicators, and uncalibrated assessors. That category is almost always larger than we'd like to admit, which means there's a great deal of recoverable consistency sitting in most programs, waiting for someone to go and get it.

So run the experiment. Look honestly at the gap. Then ask which of the four levers would have closed it. Because an assessment that depends on who's holding the clipboard isn't an assessment — it's an opinion pretending to be a measurement. And everyone relying on the result deserves better than that.

Until next time, thanks for reading and keep well.

About the author

Murray Goldberg is the founder and CEO of SkillGrader, a platform for objective observational skill assessment. A former tenured faculty member in Computer Science at the University of British Columbia, Murray's research area was learning technologies, and in 1995 he created WebCT — the first widely-used learning management system in higher education, eventually serving 14 million students in 80 countries. He has spent three decades working to advance the art and science of learning and assessment.

Why two assessors watching the same performance give different scores — and what to do about it

So why does it happen?

The four levers that actually work