Mahima | Lead Product Designer
01
The situation
The bet nobody could verify
The gap
Every AI agent in production is a bet that the configurations hold. The people making that bet — bot builders, QA teams, product owners — had no scalable way to know if it was good.
The cost
Delayed launches, limited agent scope, and reactive firefighting that eroded client trust in the AI programme.
The answer
Checkpoint makes the invisible risk visible — structured, automated evaluation that answers "is this ready?" before deployment and "is this still working?" after.
02
The reframe
The ask vs the task
The ask
The PM had a set steps of nine touchpoints. It mapped what the product did, not what the user needed to do.
The task
I argued the real job was "providing the user conffidence over various agents." I built a journey around that job and layered three personas through it — the QA Lead who owns quality and runs tests, the CX Manager who arrives from analytics with a symptom not a test plan, and the bot builder.
What it unlocked
Conversations with QA leads confirmed it — users ran evaluations but rarely acted on results, because the product made acting on them hard. That finding, not the brief, drove every design decision that followed.




03
The enhancements
Four versions - product and experience enhancement
V1 — Simple form
Shipped. Proved the "run" concept works. But no guidance for users who didn't know what to test.
V2 — Wizard
The initial draft had five steps of uneven weight; condensed to three because uneven step weight breaks pacing worse than density does. A live cost estimate was added after engineering flagged that runs carry real cost with no user visibility. Solved depth, but assumed vocabulary an entire persona didn't have.
V3 — Conversational
Natural language in, structured config out. Built because the journey map showed the CX Manager persona arrives from analytics with a symptom, not a test plan. Solved the vocabulary gap and surfaced the real problem: the issue was never the form.
The pivot
Users read the insight, navigated away, lost context, didn't follow through. The steepest drop-off wasn't in running an evaluation it was between the recommendation and the action.
V4 — Persistent toolbar
A floating panel on every page. Pin a recommendation, carry it to where the fix lives, act with the insight in view, re-run to verify. Collapsed by default so it doesn't compete with primary tasks; scoped to only what serves the insight-to-action loop — no dashboards, no settings.
04
Iterations
Sketches for various iterations
NDA restricts me to present the actual screen designs

05
The choice
Why not just automate the fix?
Competing approaches
Other tools in this space automate the fix outright or route it through a copilot.
Checkpoint keeps the human in control while removing every friction that stops them from following through. When a misconfigured agent runs at scale, a wrong fix costs thousands..