User Guide
Human-in-the-Loop Review
How reviewers query, challenge, and sign off AI assessment findings
Version 1.0 | April 2026
Programme Insights | programmeinsights.co.uk
Table of Contents
-
1
Why Human-in-the-Loop Review
3
-
2
The Six Interaction Types
4
-
3
The Review Interface
6
-
4
Challenging a Finding
8
-
5
Force Override
10
-
6
The Audit Trail
12
-
7
Review Progress and Analytics
14
-
8
How It Flows Into Reports
16
-
9
Getting Started
18
1. Why Human-in-the-Loop Review
AI assessment without traceable human oversight is just a chatbot with a dashboard. Consultancies need to prove their review methodology. This feature makes the human review traceable, defensible, and auditable.
The Problem
When a consultancy delivers an IPA gateway review report, they need to demonstrate four things:
- Every finding was reviewed by a qualified professional
- Challenges were considered and either incorporated or explained
- Professional judgement was applied — the AI did not simply produce a report that was rubber-stamped
- The review process was systematic — not ad hoc
Without Programme Insights, a consultancy's review workbook is an Excel spreadsheet with criteria, RAG ratings, and short comments. No traceability of how ratings were determined. No evidence of systematic review methodology. Typically produced by 2–3 consultants over 2–3 weeks.
With human-in-the-loop review, the same workbook becomes a structured assessment with full evidence citations, a complete audit trail showing who reviewed what, when, and why, and traceable challenges with resolutions. Produced in 2–3 days with 1–2 reviewers using AI as the first-pass assessor.
The Escalation Model
Programme Insights follows an escalation-based model: the AI handles the assessment autonomously, and human reviewers focus on the exceptions — the findings that need professional judgement.
The Stanford Enterprise AI Playbook (March 2026), studying 51 enterprise deployments, found that escalation-based models deliver significantly higher productivity gains than approval-only models where humans must sign off every output.
Key Statistic
71% median productivity gain when AI handles the baseline and humans review exceptions — versus 30% for approval-only models where every output requires human sign-off.
Source: Stanford Enterprise AI Playbook, March 2026
In practice, the AI handles the 80%+ of criteria where the evidence is clear. The human reviewer focuses on the 15–20% that need professional judgement. The audit trail proves the methodology was applied.
2. The Six Interaction Types
Every finding in your assessment results is interactive. You can query it, challenge it, add context, accept it, dismiss it, or override it. Each action triggers a different system response and is recorded in the audit trail.
| Action |
What It Does |
Changes Rating? |
Cost |
| Accept |
Stamps the finding as reviewed and agreed. The professional sign-off. Most frequent action, lowest friction — a single checkmark toggle. |
No |
Zero — database write only |
| Query |
Ask the system to explain its reasoning. The response is grounded in retrieved evidence, the scoring rubric, and calibration rules. Read-only — no rating impact. |
Never |
Single LLM call, ~3–8 seconds |
| Challenge |
Provide reasoning or additional context that should change the rating. Triggers a targeted re-assessment of that single criterion. You propose a new rating and justify it. |
Potentially |
Re-retrieval + LLM scoring, ~15–30 seconds |
| Force Override |
Override the AI's rating based on professional judgement. Only available after a rejected challenge. Requires mandatory justification (minimum 50 characters). The resulting badge displays "Human Override". |
Always |
Zero — database write only |
| Add Context |
Provide information not in the uploaded documents — verbal updates, recent decisions, known issues. Appears as a reviewer annotation in reports. Does not trigger re-assessment. |
No |
Zero — database write only |
| Dismiss |
Mark a criterion as not applicable with a reason. Dismissed criteria are excluded from aggregate scoring, greyed out in the interface, and clearly marked in reports. |
Removes from aggregation |
Zero — deterministic recalculation |
The Action Strip
Each criterion row in the assessment view has a persistent action strip on the right side:
Query
Challenge
Add Context
More ↓
✓ Accept
- Query, Challenge, Add Context are always visible as buttons
- More dropdown contains Dismiss (Force Override only appears after a rejected challenge)
- Accept is a standalone checkmark toggle — most frequent action, needs lowest friction
Example: Query in Action
You ask: "Why did you rate this GREEN when the cost estimate is from 2024?"
System responds: "This criterion assesses whether a cost estimation methodology exists, not whether estimates are current. The FBC Section 4.2 demonstrates a structured QRA-based methodology with P50/P80 ranges. The age of the estimate would be captured under criterion GR-3.4 (Cost estimate currency)."
3. The Review Interface
The review interface uses a side drawer pattern — the criteria list stays visible on the left while the interaction content opens on the right. You never lose context of what you are reviewing.
Why a Side Drawer
- Not a modal — you need to see the finding you are interacting with. A modal obscures context.
- Not inline expansion — interaction content (AI explanations, re-assessment results, audit history) can be lengthy. Expanding inline would push other findings out of view.
- Not a floating chat — interactions are anchored to specific findings, not free-form. This is structured professional review, not conversation.
Layout Overview
Criteria Browser (60%)
[Category: Financial Case]
RED
GR-3.1 Robust cost estimation methodology
→
Evidence: FBC S4.2, Cost Review Memo • Gap: No QRA or P50 analysis
AMB
GR-3.2 Cost contingency and risk allowance
GRN
GR-3.3 Funding approval and affordability
RED
GR-3.4 Cost estimate currency and indexation
Interaction Drawer (40%)
GR-3.1: Cost estimation methodology
RED
v3
Explain
History
Audit
Assessment justification, key evidence, gaps, and guidance reference appear here...
[User input area]
Submit
Cancel
Drawer Content
When you click on any criterion, the drawer opens with three tabs:
Explain Tab (Default)
Shows the assessment justification, key evidence retrieved, gaps identified, and the guidance reference used. Includes a "ask a follow-up" input for queries.
History Tab
Shows rating changes over time — across assessment runs and any challenge-driven re-assessments. Each version shows the rating, date, and what triggered the change.
Audit Tab
Shows all interactions on this criterion by all users, newest first. Every query, challenge, context addition, acceptance, and override is listed with timestamps, user names, and full text.
Opening and Closing
- Click any criterion row to open the drawer with that finding's detail
- Click a different criterion to switch the drawer content without closing it
- Press
Esc or click outside the drawer to close it
- The drawer slides in from the right (300ms ease-out transition)
Interactive Mockup
See the full interactive mockup at
programmeinsights.co.uk/mockup-hitl-side-drawer.html
4. Challenging a Finding
When you disagree with a finding, challenge it. The system re-assesses with your input and either changes the rating or explains why the original stands. Both outcomes are recorded.
Step-by-Step Walkthrough
- You are scanning the criteria browser. You see GR-3.1: Robust cost estimation methodology rated RED. Based on your experience, you believe this should be AMBER.
- Click the Challenge button on the GR-3.1 row. The side drawer slides open with the challenge form pre-focused.
- Select your proposed rating from the dropdown: AMBER. Write your justification in the text area (minimum 20 characters — forces substantive reasoning).
- Click Submit Challenge. The button changes to a spinner: "Re-assessing with your input... (15–30s)". The system runs a targeted re-assessment of that single criterion, injecting your challenge text as additional context.
- Result appears. One of two outcomes:
- Rating changed: Banner shows "Rating updated: RED → AMBER. See explanation below." The new justification appears with your context highlighted.
- Rating unchanged: Banner shows "Rating remains RED after re-assessment. See explanation below." The AI explains why the original rating holds despite your input.
Example Challenge
You write: "This should be AMBER. The QRA was done by the delivery team, not an independent estimator. IPA would flag this."
System responds (rating changed): "Rating updated to AMBER. While a QRA methodology is documented in FBC Section 4.2, the lack of independent verification represents a material weakness that would be flagged in an IPA Gate 3 review."
When the System Disagrees
This is the most important UX moment. The system does not simply say "no". It:
- Acknowledges your point — never dismisses it
- Explains the specific evidence that supports the original rating
- Distinguishes scope — often your concern is valid but applies to a different criterion
- Offers alternatives — "Your concern about independent verification may be better captured under criterion GR-3.4. Would you like to review that criterion?"
By Design
The system can disagree with you. This is intentional — it maintains the integrity of the assessment. If you still disagree after a rejected challenge, you have Force Override (see Section 5).
What Happens Under the Hood
The re-assessment does not re-run the full assessment pipeline (that would take 2–5 minutes). It runs a targeted 3-step process:
1. Re-Retrieve
If your challenge references specific documents or evidence, a targeted retrieval runs against your claims. Skipped if the challenge is purely interpretive.
2. Re-Score
The same scoring rubric is applied, but with your challenge text injected as additional context. The system considers your input fairly — it does not defer simply because you challenged.
3. Compare & Record
The new rating is compared to the original. If changed, the criterion updates and aggregates recalculate. Either way, the full re-assessment response is stored in the audit trail.
5. Force Override
When the AI disagrees with your challenge but your professional judgement takes precedence, Force Override is your final authority. It is the safety valve that ensures human expertise always wins.
When It Becomes Available
Force Override is not available as a first action. It only appears after a challenge has been rejected by the system. This sequence is deliberate:
AI Rating
→
Challenge Submitted
→
AI Rejects Challenge
→
Force Override Available
This means every override has context: you challenged, the system explained why it disagreed, and you still chose to override. The full chain is preserved.
How to Override
- After your challenge is rejected, the Force Override button appears below the system's response in the drawer.
- Click Force Override. A form appears with a dropdown for your chosen rating and a justification text area.
- Write your justification — minimum 50 characters. This must explain your professional reasoning. One-word justifications are not accepted.
- Click Confirm Override. The rating changes immediately to your specified value.
What Changes
- The criterion's RAG badge updates to your chosen rating
- A Human Override badge appears on the finding with a distinct visual marker
- Category and overall ratings recalculate to reflect the change
- The full chain is recorded in the audit trail: AI rating → your challenge → AI rejection reasoning → your override with justification
Example Override Justification
"Overriding to AMBER. As a chartered cost consultant with 20 years' experience, the absence of independent QRA verification is a material weakness that IPA would flag at Gate 3. The AI's assessment of the methodology as adequate does not account for IPA's expectation of independence."
Rules and Limits
| Rule |
Detail |
| One override per criterion |
Once overridden, the rating is final. No re-overriding. |
| Maximum 3 challenges before override |
After 3 rejected challenges, Force Override is the only remaining path. The system prompts: "You can Force Override the rating with your professional justification." |
| Reviewer or Admin role required |
Viewers cannot override. Only users with Reviewer, Admin, or Project Owner roles. |
| Minimum 50-character justification |
The justification text appears in the audit trail and reports. It must be substantive. |
The Audit Trail Records Everything
The override, the AI's reasoning for its original rating, the AI's reasoning for rejecting your challenge, and your justification for overriding — all appear in the audit trail. Nothing is hidden.
6. The Audit Trail
The audit trail is not a compliance checkbox. For consultancies delivering IPA gateway reviews, the audit trail IS the professional work product. It proves systematic review methodology was applied.
What Gets Recorded
Every interaction records the following data:
| Field |
Description |
| Timestamp |
When the interaction occurred (date and time to the second) |
| User |
Who performed the interaction (name and role) |
| Action Type |
Query, Challenge, Accept, Force Override, Add Context, or Dismiss |
| Comment |
What the reviewer wrote — the question, justification, or context |
| Original Rating |
The rating before the interaction |
| New Rating |
The rating after (null if unchanged) |
| AI Response |
The system's explanation, re-assessment result, or acknowledgement |
| Duration |
How long the system took to respond (in milliseconds) |
The Audit Timeline
In the Audit tab of the side drawer, interactions appear in reverse chronological order. Here is an example for criterion GR-3.1:
GR-3.1: Robust cost estimation methodology
Rating: RED → AMBER (challenged)
15 Apr 2026, 14:32
"Confirmed after reviewing supporting documents"
15 Apr 2026, 11:45
"QRA exists but not independently verified"
AI: "Rating updated to AMBER. While a QRA methodology is documented, the lack of independent verification represents a material weakness..."
15 Apr 2026, 09:15
"Why RED when there is a QRA in the FBC?"
AI: "The RED rating reflects two factors: (1) the QRA uses 2024 base costs without indexation..."
14 Apr 2026, 22:00
Original rating: RED | Confidence: 0.85
Why It Matters
The interaction audit trail provides something consultancies have never had before: perfect traceability of the review process. When a client or IPA assessor asks "how did you arrive at this rating?", the answer is not a verbal explanation — it is a timestamped record of every query, challenge, resolution, and sign-off.
Interactive Mockup
See the full audit timeline mockup at
programmeinsights.co.uk/mockup-hitl-audit-timeline.html
7. Review Progress and Analytics
Track your review progress at a glance. Visual indicators on every criterion show its interaction state, and a progress bar per category tells you how much of the assessment has been reviewed.
Progress Bar
A review progress bar sits at the top of each category in the criteria browser:
Financial Case
12 of 24 criteria reviewed (50%)
The progress bar gives you a sense of completion and helps you track where you left off between sessions.
Visual Indicators Per Criterion
Each criterion row in the browser shows its current interaction state with an icon:
| Icon |
State |
Meaning |
| (none) |
Unreviewed |
No human interaction yet |
| ✓ |
Accepted |
Reviewer confirmed the finding |
| 💬 |
Has comments |
Query, context, or challenge exists on this criterion |
| ⇔ |
Challenged & changed |
Rating was changed via a successful challenge |
| 🛡 |
Force overridden |
Rating set by human professional judgement, overriding AI |
| ABC |
Dismissed |
Not applicable, with documented reason |
Keyboard Shortcuts
When a criterion is focused in the browser, you can work through your review without touching the mouse:
| Key |
Action |
Q | Open Query form |
C | Open Challenge form |
A | Open Add Context form |
Enter | Accept (toggle reviewed status) |
D | Dismiss with reason |
O | Force Override (when available) |
↑ ↓ | Navigate between criteria |
Esc | Close the interaction drawer |
Batch Review
For categories where most criteria are straightforward, batch review saves time:
- Toggle Review Mode at the top of the criteria browser. Checkboxes appear on each criterion row.
- Select multiple criteria by checking boxes, or use Select All in Category.
- Click Accept All Selected to stamp all as reviewed in a single action, with bulk audit entries created.
You can also use sequential review mode: arrow keys or the "Next" button in the drawer advances to the next criterion without closing it. Work through an entire category without mouse interaction: review, accept (Enter), next (arrow), review, challenge (C), type, submit, next.
Interactive Mockup
See the review progress mockup at
programmeinsights.co.uk/mockup-hitl-review-progress.html
8. How It Flows Into Reports
Your review interactions do not disappear into a database. They appear in three distinct places in your PDF reports, turning AI assessment into a documented professional review.
1. Inline Reviewer Notes on Findings
Each finding that has been reviewed shows a Reviewer Notes section beneath the assessment justification in the Full Assessment report:
GR-3.1: Robust cost estimation methodology
Rating: AMBER (revised from RED)
Assessment: The cost estimation methodology documented in FBC Section 4.2 demonstrates a structured QRA-based approach...
Reviewer Notes
• Challenged by J. Reece (15/04/2026): "QRA exists but not independently verified." Rating revised RED → AMBER.
• Accepted by S. Chen (15/04/2026): Confirmed after review.
2. Review Summary Section
A new section appears between the Executive Summary and Detailed Findings in all report types:
Review Summary
| Assessment reviewed by |
James Reece (Head of PMO), Sarah Chen (Cost Analyst) |
| Review period |
14–15 April 2026 |
| Criteria reviewed |
24 of 24 (100%) |
| Ratings challenged |
3 (2 changed, 1 upheld) |
| Criteria dismissed |
1 (GR-4.7 — PFI criterion, not applicable) |
| Additional context |
5 annotations |
3. Full Interaction Log (Premium Reports)
For assurance-grade reports, an appendix lists every interaction with full text. This is the work product that consultancies attach to their deliverables — the equivalent of a traditional review workbook, but with perfect traceability.
The appendix includes:
- Every interaction, chronologically ordered, per criterion
- Full text of reviewer comments and AI responses
- Rating change history with before/after values
- Override justifications with the full challenge-rejection-override chain
- Dismiss reasons for excluded criteria
The Report IS the Work Product
A completed PI assessment with full reviewer interactions is equivalent to a traditional review workbook — structured assessment, evidence citations, professional sign-off, challenge resolution, and override justification — but with perfect traceability and produced in days rather than weeks.
9. Getting Started
A practical checklist to work through your first review. Start with the category that matters most and work outward.
- Open your assessment results. Navigate to the assessment you want to review from your dashboard. The criteria browser shows all findings grouped by category.
- Start with the category that matters most. If you are preparing for an IPA gateway, start with the category most likely to draw scrutiny. The progress bar at the top of each category shows 0% reviewed — this is your starting point.
- Work through criteria: accept the clear ones, query the uncertain ones. For findings where the AI's assessment matches your view, hit
Enter to accept and move on. For findings where you want to understand the reasoning, press Q to query.
- Challenge findings you disagree with — provide evidence. When you see a rating that does not match your professional judgement, press
C to challenge. Select your proposed rating and write your justification. The more specific your reasoning, the better the re-assessment.
- Use Force Override only when your professional judgement demands it. If the system rejects your challenge and you still disagree, override it. Write a clear justification — this appears in the audit trail and reports.
- Check the progress bar — aim for 100% reviewed. A fully reviewed assessment produces the strongest reports. The progress bar tracks your completion per category.
- Generate your report — the audit trail appears automatically. When you generate a PDF report, all reviewer notes, the review summary, and the interaction log (premium reports) are included without any extra steps.
Typical Review Session
A typical assessment has 24–60 criteria per category. A thorough review usually involves accepting 15–20 criteria, querying 5–10, challenging 3–5, and adding context to a handful. Most reviewers complete a category in 30–45 minutes.
Quick Reference: Action Summary
Most Common Workflow
1. Scan criteria in the browser
2. Accept obvious findings (Enter)
3. Query anything unclear (Q)
4. Challenge disagreements (C)
5. Override if needed (O)
6. Generate report
Key Shortcuts
Enter — Accept
Q — Query
C — Challenge
A — Add Context
D — Dismiss
O — Force Override
Esc — Close drawer