Validated Instruments for Measuring Conflict Skills (and When to Write Your Own)

TKI, ROCI-II, psychological safety scales, SJTs – what validated conflict assessments measure, what they cost, and when to write your own instead.

9 min readBy Matthew Stublefield
Close-up of a metal vernier caliper's measurement scales

There's a folder in the Conflict Campaign research vault labeled Assessments. Four instruments live in it: the ROCI-II conflict questionnaire, a team psychological safety assessment, the Marlowe-Crowne social desirability scale, and a transcription of Van Dyne and LePine's six-item voice behavior scale. Adam and I gathered them while working out how we'd know whether Operation Aetherfall actually changes anyone's behavior – and I like that folder as a miniature of the whole measurement problem, right down to the fact that one of the four instruments exists mostly to catch people fibbing on the other three.

To measure conflict resolution skills, pair a validated self-report instrument with at least one behavioral measure, and run both before training and again three to six months after. The standard self-report options fall into three families: conflict-style inventories (the Thomas-Kilmann Conflict Mode Instrument and the Rahim Organizational Conflict Inventory-II), psychological safety and voice scales (Edmondson's seven-item team scale, Van Dyne and LePine's voice behavior items), and situational judgment tests. None of them measures skill directly. They measure preference, perception, or intention – which is why any serious plan triangulates self-report against something you can observe.

Measurement is the quieter half of the work I described in how to debrief a training exercise. We synthesized more than 200 studies before designing our first scenario, and the assessment literature was the most humbling stretch of that reading. I'm a product manager, not a psychometrician. This is the map I wish someone had handed me at the start.

What are the standard conflict resolution assessment tools?

Two instruments dominate conflict-style assessment, and they make a tidy contrast in licensing philosophy.

The Thomas-Kilmann Conflict Mode Instrument (TKI) is the most widely used conflict assessment in the world, with more than 50 years of validation evidence behind it. Thirty forced-choice item pairs, 10 to 15 minutes to complete, five modes as output – competing, collaborating, compromising, avoiding, accommodating – arranged along axes of assertiveness and cooperativeness. It's commercial: $21.95 per paper administration, $45 online, with certification training sold separately to practitioners. Its scoring is also ipsative, meaning your five mode scores always sum to 30. You're being compared against yourself, never against anyone else, which quietly rules out most of the statistics you might want to run across a team.

The Rahim Organizational Conflict Inventory-II (ROCI-II) is the publicly available counterpart. Twenty-eight Likert items, about eight minutes, five styles (integrating, obliging, dominating, avoiding, compromising), internal consistency of 0.72 to 0.77 across subscales. Its quiet superpower is that it comes in three forms – conflict with your supervisor, with your subordinates, with your peers – because the same person often handles those three situations three different ways. And its scoring is non-ipsative, so group averages and between-person comparisons actually work.

A third option, the Conflict Dynamics Profile (CDP), measures fifteen behavioral scales instead of styles – including "hot buttons," the emotional triggers that escalate you – and offers a 360-degree version that checks your self-perception against ratings from your boss, peers, and direct reports. It's commercial, requires certification to administer, and carries a thinner peer-reviewed evidence base than the TKI or ROCI-II, though healthcare professionals using its feedback reported real gains in self-awareness.

InstrumentWhat it measuresFormatLicensingThe catch
TKIFive conflict-style preferences30 forced-choice pairs, 10–15 minCommercial ($21.95 paper / $45 online)Ipsative scoring; preference, not skill
ROCI-IIFive conflict styles by relationship (boss, peer, report)28 Likert items, ~8 minPublicly availableFactor structure shifts in some populations
CDP15 conflict behaviors, including emotional triggersSelf-report or 360-degreeCommercial, certification requiredLess peer-reviewed documentation
Edmondson psychological safety scaleTeam climate for interpersonal risk-taking7 Likert itemsPublished in the research literatureMeasures climate, not individual skill
Voice behavior scale (Van Dyne & LePine, 1998)Speaking up with ideas and concerns6 items, 7-point scalePublished in the research literatureSelf-report; pair with observation
Marlowe-Crowne Form CSocial desirability tendency10 true/false itemsPublished in the research literatureCan't separate honest saints from good liars

What can a conflict-style score actually tell you?

A style score tells you what someone prefers, not what they can do. That's the TKI's own documented limitation – it measures preference rather than actual behavior or competency – and it's the reason style inventories make weak outcome measures. A six-week conflict workshop at Michigan State University administered the TKI ten weeks apart, before and after, and found a trend toward more cooperative conflict management that wasn't statistically significant. That result repeats across the literature: styles are stable-ish traits, and stable things make poor progress bars.

Styles also wobble by relationship. Plenty of people collaborate with peers and avoid with their boss, which is exactly why the ROCI-II ships three separate forms.

Where style inventories earn their money is at the front of training, as awareness and shared vocabulary. Handing a team the language of "avoiding" and "accommodating" changes the debrief conversations that follow. So here's the stance I'll defend to any certified practitioner: a style inventory is a conversation starter, not a scoreboard.

How do you measure psychological safety and willingness to speak up?

Use Edmondson's seven-item team psychological safety scale for climate, and a voice behavior scale for the speaking-up behavior itself. These constructs sit closer to what conflict training actually changes – the willingness to engage at all – than any style label does.

Edmondson's scale, developed in 1999, remains the most widely used psychological safety measure in the field. Seven items, five-to-seven-point Likert response, reliability typically 0.75 to 0.90 across studies, with 25+ years of validation in healthcare, corporate, and educational settings. The items are blunt in a useful way: "It is safe to take a risk on this team," and reverse-scored, "If you make a mistake on this team, it is often held against you." It measures the team, not the individual, and it predicts things worth predicting – team learning behavior, burnout, safety culture. If you need to justify caring about the construct at all, the business case for psychological safety has the receipts.

The voice behavior scale is the one we transcribed into our own assessment packet. Van Dyne and LePine published it in 1998: six items on a seven-point agree/disagree scale, including "I speak up and encourage others in my work unit to get involved in issues that affect our work." Our reasoning was simple. If conflict training works, it should show up as more voice – more people raising problems, disagreeing openly, recommending changes – and this scale asks about those behaviors directly instead of asking people to characterize their style.

Where do situational judgment tests fit?

Situational judgment tests hand people realistic workplace scenarios and ask them to pick the most and least effective responses. Meta-analytic research shows SJTs predict job performance 51% better than traditional interviews, which makes them the strongest self-administered option on this page. The clever feature is dual scoring: ask for the "most effective" response and you're measuring judgment; ask for the "most likely" response and you're measuring behavioral tendency. The gap between someone's two answers is itself worth discussing in a debrief.

A full battery runs 30 to 45 minutes. The catch is that scenarios have to mirror your workplace's actual conflicts – interdepartmental turf fights for one org, faculty disagreements for another – and writing good scenarios is the real cost. If you already run scenario-based training, you're partway there.

How bad is the social-desirability problem?

Bad enough that it deserves its own instrument. Social desirability bias is the tendency to answer in whatever way makes you look good, and the measured gaps are not subtle. During COVID, 94.5% of 1,434 respondents told a direct question they washed their hands properly; an indirect questioning technique that shields individual answers put the figure at 78.1%. Among 1,622 medical students, 2.71% admitted personal involvement in academic misconduct while 27.11% reported observing it in others – a tenfold gap. And in a review of health questionnaire studies, only about 7% used a social desirability scale at all; of those that did, 43 to 45% found the bias genuinely influenced their outcomes.

Conflict-training surveys sit at maximum pressure. "I am willing to engage in conflict" – who circles Strongly Disagree the week after their company paid for conflict training? The divergence shows up in behavior too: when researchers directly observed 100 therapy sessions across 36 clinicians and compared the observations with the clinicians' own accounts of their practice, self-report was the most discrepant of the three measurement types tested.

You have two defenses. The statistical one is the Marlowe-Crowne scale – Form C is ten true/false items like "I have never deliberately said something that hurt someone's feelings" – which estimates each respondent's tendency to self-flatter so you can control for it. It has a known flaw: a high score might mean someone's lying, or might mean they're genuinely that conscientious, and the scale can't tell you which. The cheaper defense is item design. "How many times in the past month did you initiate a difficult conversation?" is much harder to fudge than an attitude statement, because inventing a specific count takes effort that agreeing with a virtue doesn't.

When should you write your own instrument?

Write your own when you're a small team measuring directional movement against specific training objectives. Use a validated instrument when you need numbers you can defend to someone outside the room. With eight or twelve participants, nothing on this list will produce statistics a researcher would accept anyway – what a team lead actually needs to know is whether more people are raising problems earlier than they were in March. A short homegrown survey, run before and after, answers that, and I've written up the mechanics in pre- and post-training surveys. Homegrown also wins when your objective is narrower than any published construct. "Disagrees with the boss in sprint planning without going dark for two days afterward" is not a TKI mode.

The design principles come from the same research that validates the big instruments. Ask about behavior, not attitude – counts and recent specifics over agree/disagree virtue statements. Force differentiation: "When I raise a problem, my manager listens and acts / listens but doesn't act / becomes defensive" beats a five-point agreement scale, because every option describes a real state rather than a level of enthusiasm. Mix in reverse-worded items so straight-line agreers reveal themselves.

Borrow the bones of validated scales where licensing allows. We didn't invent our voice items; we used Van Dyne and LePine's six, on their seven-point format, because item wording is where decades of validity work actually lives. Make anonymity believable – coded identifiers to match pre and post responses, no names, online and self-paced, and say all of that out loud, since online self-administered surveys show measurably less social desirability bias than anything with an interviewer attached. And time it honestly: baseline before training, follow-up at three to six months. A survey one week out is measuring enthusiasm and intention, which is a legitimate but different job – one I've broken down in the one-week follow-up survey.

One last finding worth stealing. In a pharmacy training program where students learned the exact assessment rubric beforehand and practiced against it, their self-ratings stopped diverging from trained observers' ratings. Publish your criteria. Self-assessment gets dramatically more accurate when people know precisely what's being assessed.

Adam came to counseling after 13 years as an electrical engineer, so between the two of us there's one person who instinctively trusts instruments and one who instinctively trusts conversations. The measurement plan we landed on trusts neither alone. Measure before, measure again at three months, and keep one instrument in the packet whose only job is to tell you how much to believe the others.

Put this into practice

Operation Aetherfall is a complete, pilot-tested scenario kit — facilitator guide, printable table pack, and assessment set — for running this kind of training with your own team.