How Do You Measure Whether Someone Is Actually Good at Working With AI?

5 min read

Here’s a question that sounds simple and isn’t: is your team actually good at working with AI, or are they just using it?

Using means generating output. Good at working with means the human added judgment, caught errors, maintained context, and produced something the organization can defend. The difference matters because when something goes wrong, accountability doesn’t attach to the AI. It attaches to the person who signed off.

Every organization deploying AI needs to answer this question. And almost none of them can, because the tools they’re using to measure AI skills don’t measure collaboration. They measure knowledge.

The quiz problem

The dominant approach to measuring AI capability in organizations is some form of quiz: multiple choice, scenario-based questions, self-assessment surveys. These tell you whether someone knows what good collaboration looks like. They don’t tell you whether someone does it.

This is the same gap that exists between knowing you should write tests and actually writing tests. Between knowing you should review PR diffs line by line and actually reviewing them. Knowledge and behavior diverge under real conditions, especially when the behavior is effortful and the shortcut is invisible.

The shortcut with AI is accepting output without meaningful verification. It looks like productivity. It feels like efficiency. And it’s undetectable by any assessment that asks what you would do rather than observing what you actually do.

What behavioral measurement looks like

PAICE takes a different approach. Instead of asking people about AI collaboration, it puts them in one.

The assessment is a 25-minute conversation with an AI system. It looks and feels like a normal working session: you’re given a realistic task, you collaborate with the AI to complete it, and you produce a deliverable. What you don’t know is that the AI’s outputs contain strategically injected errors – factual mistakes, logical inconsistencies, subtle hallucinations calibrated to the domain.

The assessment isn’t measuring whether you can use AI. It’s measuring what happens when the AI is wrong and you’re responsible for the output.

Do you catch the error? Do you verify claims that sound plausible? When you find a problem, do you fix it or work around it? When the AI pushes back on your correction, do you hold your ground or defer? These behavioral signals are what the scoring model captures.

Dimensional scoring

Collaboration quality isn’t a single number. Someone might be excellent at iterative prompting but terrible at verification. Another person might catch every error but struggle to give the AI useful feedback. A single score flattens these differences into noise.

PAICE measures across multiple dimensions independently:

Accountability measures whether someone verifies outputs, detects injected errors, and takes ownership of the final work product. This is consistently the lowest-scoring dimension across all populations tested. People know they should verify. Under real working conditions, most don’t verify thoroughly enough.

Integrity measures whether someone maintains factual standards, catches logical inconsistencies, and refuses to use AI-generated content that doesn’t meet quality thresholds.

Collaboration quality measures the effectiveness of the human-AI interaction itself: whether feedback is specific and actionable, whether iteration actually improves the output, whether the person understands when AI adds value and when it introduces friction.

Evolution measures adaptive capacity: whether someone builds mental models of AI strengths and weaknesses over time and adjusts their approach accordingly.

Each dimension produces an independent score. For L&D teams designing targeted training, a dimensional profile is vastly more actionable than a percentage.

The engineering problem

Building this required solving several problems that don’t have obvious precedents:

Error injection that doesn’t break immersion. The injected errors have to be realistic enough that catching them requires domain judgment, not pattern recognition. If the errors are obviously wrong, you’re measuring attention, not expertise. If they’re too subtle, the signal-to-noise ratio collapses. The calibration is adaptive – the system adjusts based on how the participant is performing.

Behavioral signal extraction from conversation. The scoring model doesn’t grade the deliverable. It analyzes the collaboration process: what the participant questioned, what they accepted, how they responded to pushback, whether their verification was systematic or sporadic. This requires a multi-model architecture where the assessment AI and the scoring model operate independently.

Multi-model bias prevention. When the AI that runs the conversation is also the AI that scores it, you get circular reasoning. PAICE uses separate models for assessment delivery and scoring, with the scoring model evaluating behavioral signals rather than output quality.

Pre-post comparison for training ROI. The most valuable use case isn’t a one-time score. It’s administering the assessment before and after a training intervention and measuring whether actual behavior changed. This requires scoring stability across sessions and dimensional granularity fine enough to detect movement in specific skill areas.

Who this is for

PAICE is built for L&D leaders, HR teams, and organizational decision-makers who are deploying AI and need to know whether their people are collaborating with it effectively or just using it as a faster copy-paste.

If you’re a developer interested in the measurement architecture, the Closing the Collaboration Gap whitepaper covers the technical framework, and the Engineering Trust series explores the intersection of trust, verification, and performance measurement in human-AI systems.

paice.work

Built by SnapSynapse. PAICE.work PBC is a public benefit corporation focused on making human-AI collaboration measurable, teachable, and governable.