top of page
Case 02 · Enterprise AI · Agentic UX ·  Ongoing

How do you trust an AI agent you can't manually test?

Checkpoint — the confidence layer for AI agents in production.

ROLE

Lead UX Designer (IC)

INDUSTRY

Enterprise AI / Agentic CX

STATUS

V1 shipped · V2 in dev · V3 explored → V4 in design

01

The situation

The bet nobody could verify

The gap

Every AI agent in production is a bet that the configurations hold. The people making that bet — bot builders, QA teams, product owners — had no scalable way to know if it was good.

The cost

Delayed launches, limited agent scope, and reactive firefighting that eroded client trust in the AI programme.

The answer

Checkpoint makes the invisible risk visible — structured, automated evaluation that answers "is this ready?" before deployment and "is this still working?" after.

02

The reframe

The ask vs the task

The ask

The PM had a set steps of nine touchpoints. It mapped what the product did, not what the user needed to do.

The task

I argued the real job was "providing the user conffidence over various agents." I built a journey around that job and layered three personas through it — the QA Lead who owns quality and runs tests, the CX Manager who arrives from analytics with a symptom not a test plan, and the bot builder. 

What it unlocked

Conversations with QA leads confirmed it — users ran evaluations but rarely acted on results, because the product made acting on them hard. That finding, not the brief, drove every design decision that followed.

03

The enhancements

Four versions - product and experience enhancement

V1 — Simple form

Shipped. Proved the "run" concept works. But no guidance for users who didn't know what to test.

V2 — Wizard

The initial draft had five steps of uneven weight; condensed to three because uneven step weight breaks pacing worse than density does. A live cost estimate was added after engineering flagged that runs carry real cost with no user visibility. Solved depth, but assumed vocabulary an entire persona didn't have.

V3 — Conversational

Natural language in, structured config out. Built because the journey map showed the CX Manager persona arrives from analytics with a symptom, not a test plan. Solved the vocabulary gap and surfaced the real problem: the issue was never the form.

The pivot

Users read the insight, navigated away, lost context, didn't follow through. The steepest drop-off wasn't in running an evaluation it was between the recommendation and the action.

V4 — Persistent toolbar

A floating panel on every page. Pin a recommendation, carry it to where the fix lives, act with the insight in view, re-run to verify. Collapsed by default so it doesn't compete with primary tasks; scoped to only what serves the insight-to-action loop — no dashboards, no settings.

04

Iterations

Sketches for various iterations

NDA restricts me to present the actual screen designs

Image 18-05-26 at 11.42 AM.png

05

The choice

Why not just automate the fix?

Competing approaches

Other tools in this space automate the fix outright or route it through a copilot.

Checkpoint keeps the human in control while removing every friction that stops them from following through. When a misconfigured agent runs at scale, a wrong fix costs thousands..

© 2026 by Mahima Jain

bottom of page