How to Run Pre- and Post-Training Surveys That Show Whether Training Worked

Satisfaction scores can't tell you whether training worked. A baseline survey, a matched post-survey, and a one-week follow-up can – here's how to build them.

July 2, 20268 min readBy Matthew Stublefield

Hands writing on a clipboard form with a pen

The pre-game survey for our Operation Aetherfall pilot is thirteen questions long and takes ten to fifteen minutes. Participants fill it out before anyone explains what the day is about – before the welcome, before the framing talk, before a single word about positions versus interests. That sequencing is the whole trick, and it's the part most training programs get wrong.

To find out whether training worked, run a short baseline survey before the session, re-administer the same instrument immediately after, and send a five-minute follow-up one week later. The baseline captures how people describe their behavior before the training tells them what good looks like. The post-survey measures shifts in awareness and confidence. The follow-up catches the only evidence that counts for much: whether anyone did something differently back at work.

I'm a product manager, not a psychometrician. The instruments we built for Conflict Campaign's pilot came out of a research synthesis covering more than 200 studies on instructional design, and the choices below – what to ask, when to ask it, how to read the answers – are the ones that survived contact with that literature. Measurement is one half of the wrapper around a training session; the structured debrief is the other, and if you only build one, build the debrief. The debrief converts experience into learning. The surveys tell you whether the conversion happened.

Why measure training at all?

Because the alternative is trusting the smile sheet. The standard end-of-workshop survey measures satisfaction – Kirkpatrick's Level 1, "reaction" – and satisfaction is a poor proxy for everything you care about. People can rate a workshop five stars and change nothing on Monday. By default, only 10-15% of trained skills transfer to the job at all, and no satisfaction score will warn you which side of that line your program landed on.

There's nothing wrong with asking whether people enjoyed the day. Enjoyment matters for whether they'd come back. The problem is stopping there, because satisfaction data answers "was this pleasant?" when the question you're paying to answer is "did anything move?" The research behind our instruments identified conflict-style self-awareness as the most measurable outcome of a single session, and pre/post comparison as the most practical way to measure it for a pilot. You don't need a control group or a validated psychometric battery. You need the same questions asked twice, with a session in between.

Why survey people before the session starts?

A baseline is only clean if you take it before you've taught anyone the right answers. Our pre-game survey goes out before any instruction – as pre-work when possible, at the very start of the session otherwise. Roughly half an hour into our session plan, the pre-brief introduces positions versus interests and data-first framing for disagreeing with authority. From that moment on, every participant knows what the training wants them to say. Ask "when my manager makes a decision I disagree with, what do I do?" after the framing talk and you're measuring recall. Ask it before, and you're measuring the person.

The one-sentence version: the baseline should capture the participant who walked in the door, and that person disappears about thirty minutes after the session starts.

What should pre- and post-training survey questions measure?

For skills training, three kinds of items earn their place: conflict-approach self-ratings, situational-judgment scenarios, and self-efficacy ratings. Our pre-game instrument uses all three, and the post-game instrument repeats them word for word. Identical items are what make the comparison mean anything.

Here's what each type looks like, with examples drawn from our pilot instruments:

Conflict-approach self-ratings. Forced-choice pairs with no right answer marked. "When a colleague disagrees with my approach, I tend to: (a) explain my reasoning more thoroughly, or (b) ask questions to understand what's driving their perspective." Five of these cover the skill areas the session targets – interest-based framing, disagreeing with authority, consensus-building, escalation awareness, and avoidance.
Situational-judgment scenarios. A short workplace situation with four plausible responses. "Your manager announces a deadline your team can't meet without cutting corners on quality. What do you do?" One option is the inquiry-driven response the research favors – asking what's driving the timeline and what matters most to protect – but every option is a real strategy that real people use. You're measuring the starting distribution, not grading anyone.
Self-efficacy ratings. Confidence on a 1-5 scale for specific behaviors: "Presenting information that contradicts what a leader has decided." "Speaking up about a concern, even when it would be easier to stay quiet." Five items, matched to the same skill areas as the forced-choice pairs.

The post-game survey adds two blocks the baseline can't have. First, open-ended reflection: "What's one thing you noticed about how you handle conflict that you didn't fully realize before today?" plus a prompt to describe a moment from the exercise where someone changed the direction of a conflict, and what made it work. Second, an implementation intention – a fill-in-the-blank commitment: "The next time I [specific situation at work], instead of [my usual response], I will [specific new action]." That item pulls double duty. It's assessment data, and it's the strongest transfer mechanism the session produces.

Satisfaction items go last, and they stay small: four 1-5 ratings covering whether the exercise felt relevant to real work, whether people felt safe taking risks, whether the debrief connected the experience to their job, and whether they left with a specific plan. Even the "reaction" block is really asking about transfer conditions rather than enjoyment.

Should training surveys be anonymous?

Matched ID codes without names is the practical middle ground. Full anonymity has a real cost: if you can't link a person's pre-survey to their post-survey, you can only compare group-level distributions, and with a small cohort that throws away most of your signal. Full identification has a different cost. Several of these items ask people to admit things – that they go along with decisions they disagree with, that they wait for "the right moment" to share bad news – and honesty drops when a name sits at the top of the page.

The fix is a self-generated code each participant writes on both surveys. First pet's name plus birth month works fine. You get individual-level pre/post pairs, and nobody's answer sheet has a name on it.

Two honest caveats. The implementation intention isn't anonymous and shouldn't pretend to be – participants share it aloud during the debrief, because public commitment is part of what makes it work. And with a group of six, "anonymous" is a thin promise anyway; a facilitator who just spent the whole day with these people can often guess whose handwriting is whose. Say that plainly and let people calibrate, rather than promising privacy you can't deliver.

When should you run each survey?

Three timestamps: before any instruction, immediately after the debrief, and seven days out.

The post-survey happens before anyone leaves the room. Ten to fifteen minutes at the end of the day, while attention is still yours – response rate is effectively 100%, and the open-ended reflections double as a final consolidation exercise. A survey emailed the next morning competes with everything else in an inbox and loses.

The one-week follow-up goes out by email, sized to five to seven minutes – short enough to respect people's time, long enough to capture signal. It leads with the implementation intention: has the situation you described come up? If yes, did you try your planned response, or default to the usual one? If you didn't try it, what got in the way – forgot in the moment, felt higher-stakes than expected, the real situation felt too different from what you practiced? Then a handful of yes/no application items: in the past week, have you asked someone what's driving their concern during a disagreement? Shared uncomfortable information with someone in authority? Noticed a disagreement starting to feel personal? A few salience ratings round it out – how much have you thought about the workshop, how relevant does it feel now that you're back in the work.

That follow-up is the earliest look at what Kirkpatrick calls Level 3, actual behavior on the job. One week is too soon to claim durable change. It's exactly the right moment to learn whether the plan survived contact with a real workweek, and what's blocking the people it didn't survive for.

How do you read the results?

Directional shifts, not grades. The forced-choice items were never scored right-or-wrong, so don't start now. Look for movement toward the responses the training targeted – more "ask what's driving their perspective," less "explain my reasoning more thoroughly." And treat any changed answer as data: a participant whose responses moved at all just demonstrated that the experience prompted self-reflection, whichever direction they moved.

Scenario shifts toward inquiry-driven responses indicate the cognitive learning happened – Kirkpatrick's Level 2. Self-efficacy is the interesting one, because it can legitimately go down. A participant who walked in rating themselves 4-out-of-5 confident at "recognizing when a disagreement is becoming personal" and walked out a 2 hasn't gotten worse. They've discovered the skill is harder than they thought, which for awareness-building objectives is exactly the outcome you wanted. A confidence drop paired with a sharp open-ended reflection is a win, and you'll misread your own pilot if you count it as a loss.

Then there's the sample-size question. Our pilot seats six players. Running statistics on six people is theater – no t-tests, no percentages quoted to a decimal place. At that scale you read every response individually, look for direction, and mine the open-ended answers, which at small n are the richest data you have. "Four of six shifted toward inquiry responses, and the two who didn't both wrote reflections about avoidance" is an honest pilot finding. "Confidence improved 23%" from six people is a costume.

Small numbers also demand honesty about what you've shown. Directional pre/post shifts mean awareness moved, and awareness is a real, worthwhile outcome for a single session. They don't prove behavior changed. The one-week follow-up gives you the first hint of that; a longer horizon gives you more. Report what you measured, at the level you measured it.

Thirteen questions before, the same thirteen after, five minutes a week later. It's not a research program. It's enough to know whether the day did anything.

Put this into practice

Operation Aetherfall is a complete, pilot-tested scenario kit — facilitator guide, printable table pack, and assessment set — for running this kind of training with your own team.

See the scenario kit Browse more resources

How to Run Pre- and Post-Training Surveys That Show Whether Training Worked

Why measure training at all?

Why survey people before the session starts?

What should pre- and post-training survey questions measure?

Should training surveys be anonymous?

When should you run each survey?

How do you read the results?

More from Debriefs & Measurement

Advocacy-Inquiry: The Debrief Questioning Technique That Surfaces Real Reasoning

50+ Debrief Questions for Team Exercises (Organized by What They Surface)

How to Debrief a Training Exercise: A Three-Phase Guide That Actually Changes Behavior

Put this into practice