User Guide

Human-in-the-Loop Review

How reviewers query, challenge, and sign off AI assessment findings

Version 1.0 | April 2026 Programme Insights | programmeinsights.co.uk

Table of Contents

1. Why Human-in-the-Loop Review

AI assessment without traceable human oversight is just a chatbot with a dashboard. Consultancies need to prove their review methodology. This feature makes the human review traceable, defensible, and auditable.

The Problem

When a consultancy delivers an IPA gateway review report, they need to demonstrate four things:

  1. Every finding was reviewed by a qualified professional
  2. Challenges were considered and either incorporated or explained
  3. Professional judgement was applied — the AI did not simply produce a report that was rubber-stamped
  4. The review process was systematic — not ad hoc

Without Programme Insights, a consultancy's review workbook is an Excel spreadsheet with criteria, RAG ratings, and short comments. No traceability of how ratings were determined. No evidence of systematic review methodology. Typically produced by 2–3 consultants over 2–3 weeks.

With human-in-the-loop review, the same workbook becomes a structured assessment with full evidence citations, a complete audit trail showing who reviewed what, when, and why, and traceable challenges with resolutions. Produced in 2–3 days with 1–2 reviewers using AI as the first-pass assessor.

The Escalation Model

Programme Insights follows an escalation-based model: the AI handles the assessment autonomously, and human reviewers focus on the exceptions — the findings that need professional judgement.

The Stanford Enterprise AI Playbook (March 2026), studying 51 enterprise deployments, found that escalation-based models deliver significantly higher productivity gains than approval-only models where humans must sign off every output.

Key Statistic
71% median productivity gain when AI handles the baseline and humans review exceptions — versus 30% for approval-only models where every output requires human sign-off. Source: Stanford Enterprise AI Playbook, March 2026

In practice, the AI handles the 80%+ of criteria where the evidence is clear. The human reviewer focuses on the 15–20% that need professional judgement. The audit trail proves the methodology was applied.

2. The Six Interaction Types

Every finding in your assessment results is interactive. You can query it, challenge it, add context, accept it, dismiss it, or override it. Each action triggers a different system response and is recorded in the audit trail.

Action What It Does Changes Rating? Cost
Accept Stamps the finding as reviewed and agreed. The professional sign-off. Most frequent action, lowest friction — a single checkmark toggle. No Zero — database write only
Query Ask the system to explain its reasoning. The response is grounded in retrieved evidence, the scoring rubric, and calibration rules. Read-only — no rating impact. Never Single LLM call, ~3–8 seconds
Challenge Provide reasoning or additional context that should change the rating. Triggers a targeted re-assessment of that single criterion. You propose a new rating and justify it. Potentially Re-retrieval + LLM scoring, ~15–30 seconds
Force Override Override the AI's rating based on professional judgement. Only available after a rejected challenge. Requires mandatory justification (minimum 50 characters). The resulting badge displays "Human Override". Always Zero — database write only
Add Context Provide information not in the uploaded documents — verbal updates, recent decisions, known issues. Appears as a reviewer annotation in reports. Does not trigger re-assessment. No Zero — database write only
Dismiss Mark a criterion as not applicable with a reason. Dismissed criteria are excluded from aggregate scoring, greyed out in the interface, and clearly marked in reports. Removes from aggregation Zero — deterministic recalculation

The Action Strip

Each criterion row in the assessment view has a persistent action strip on the right side:

Query Challenge Add Context More ↓ ✓ Accept
Example: Query in Action
You ask: "Why did you rate this GREEN when the cost estimate is from 2024?"

System responds: "This criterion assesses whether a cost estimation methodology exists, not whether estimates are current. The FBC Section 4.2 demonstrates a structured QRA-based methodology with P50/P80 ranges. The age of the estimate would be captured under criterion GR-3.4 (Cost estimate currency)."

3. The Review Interface

The review interface uses a side drawer pattern — the criteria list stays visible on the left while the interaction content opens on the right. You never lose context of what you are reviewing.

Why a Side Drawer

Layout Overview

Criteria Browser (60%)
[Category: Financial Case]
RED GR-3.1 Robust cost estimation methodology
Evidence: FBC S4.2, Cost Review Memo • Gap: No QRA or P50 analysis
AMB GR-3.2 Cost contingency and risk allowance
GRN GR-3.3 Funding approval and affordability
RED GR-3.4 Cost estimate currency and indexation
Interaction Drawer (40%)
GR-3.1: Cost estimation methodology
RED v3
Explain History Audit
Assessment justification, key evidence, gaps, and guidance reference appear here...
[User input area]
Submit Cancel

Drawer Content

When you click on any criterion, the drawer opens with three tabs:

Explain Tab (Default)

Shows the assessment justification, key evidence retrieved, gaps identified, and the guidance reference used. Includes a "ask a follow-up" input for queries.

History Tab

Shows rating changes over time — across assessment runs and any challenge-driven re-assessments. Each version shows the rating, date, and what triggered the change.

Audit Tab

Shows all interactions on this criterion by all users, newest first. Every query, challenge, context addition, acceptance, and override is listed with timestamps, user names, and full text.

Opening and Closing

Interactive Mockup
See the full interactive mockup at programmeinsights.co.uk/mockup-hitl-side-drawer.html

4. Challenging a Finding

When you disagree with a finding, challenge it. The system re-assesses with your input and either changes the rating or explains why the original stands. Both outcomes are recorded.

Step-by-Step Walkthrough

  1. You are scanning the criteria browser. You see GR-3.1: Robust cost estimation methodology rated RED. Based on your experience, you believe this should be AMBER.
  2. Click the Challenge button on the GR-3.1 row. The side drawer slides open with the challenge form pre-focused.
  3. Select your proposed rating from the dropdown: AMBER. Write your justification in the text area (minimum 20 characters — forces substantive reasoning).
  4. Click Submit Challenge. The button changes to a spinner: "Re-assessing with your input... (15–30s)". The system runs a targeted re-assessment of that single criterion, injecting your challenge text as additional context.
  5. Result appears. One of two outcomes:
    • Rating changed: Banner shows "Rating updated: RED → AMBER. See explanation below." The new justification appears with your context highlighted.
    • Rating unchanged: Banner shows "Rating remains RED after re-assessment. See explanation below." The AI explains why the original rating holds despite your input.
Example Challenge
You write: "This should be AMBER. The QRA was done by the delivery team, not an independent estimator. IPA would flag this."

System responds (rating changed): "Rating updated to AMBER. While a QRA methodology is documented in FBC Section 4.2, the lack of independent verification represents a material weakness that would be flagged in an IPA Gate 3 review."

When the System Disagrees

This is the most important UX moment. The system does not simply say "no". It:

  1. Acknowledges your point — never dismisses it
  2. Explains the specific evidence that supports the original rating
  3. Distinguishes scope — often your concern is valid but applies to a different criterion
  4. Offers alternatives — "Your concern about independent verification may be better captured under criterion GR-3.4. Would you like to review that criterion?"
By Design
The system can disagree with you. This is intentional — it maintains the integrity of the assessment. If you still disagree after a rejected challenge, you have Force Override (see Section 5).

What Happens Under the Hood

The re-assessment does not re-run the full assessment pipeline (that would take 2–5 minutes). It runs a targeted 3-step process:

1. Re-Retrieve

If your challenge references specific documents or evidence, a targeted retrieval runs against your claims. Skipped if the challenge is purely interpretive.

2. Re-Score

The same scoring rubric is applied, but with your challenge text injected as additional context. The system considers your input fairly — it does not defer simply because you challenged.

3. Compare & Record

The new rating is compared to the original. If changed, the criterion updates and aggregates recalculate. Either way, the full re-assessment response is stored in the audit trail.

5. Force Override

When the AI disagrees with your challenge but your professional judgement takes precedence, Force Override is your final authority. It is the safety valve that ensures human expertise always wins.

When It Becomes Available

Force Override is not available as a first action. It only appears after a challenge has been rejected by the system. This sequence is deliberate:

AI Rating Challenge Submitted AI Rejects Challenge Force Override Available

This means every override has context: you challenged, the system explained why it disagreed, and you still chose to override. The full chain is preserved.

How to Override

  1. After your challenge is rejected, the Force Override button appears below the system's response in the drawer.
  2. Click Force Override. A form appears with a dropdown for your chosen rating and a justification text area.
  3. Write your justification — minimum 50 characters. This must explain your professional reasoning. One-word justifications are not accepted.
  4. Click Confirm Override. The rating changes immediately to your specified value.

What Changes

Example Override Justification
"Overriding to AMBER. As a chartered cost consultant with 20 years' experience, the absence of independent QRA verification is a material weakness that IPA would flag at Gate 3. The AI's assessment of the methodology as adequate does not account for IPA's expectation of independence."

Rules and Limits

Rule Detail
One override per criterion Once overridden, the rating is final. No re-overriding.
Maximum 3 challenges before override After 3 rejected challenges, Force Override is the only remaining path. The system prompts: "You can Force Override the rating with your professional justification."
Reviewer or Admin role required Viewers cannot override. Only users with Reviewer, Admin, or Project Owner roles.
Minimum 50-character justification The justification text appears in the audit trail and reports. It must be substantive.
The Audit Trail Records Everything
The override, the AI's reasoning for its original rating, the AI's reasoning for rejecting your challenge, and your justification for overriding — all appear in the audit trail. Nothing is hidden.

6. The Audit Trail

The audit trail is not a compliance checkbox. For consultancies delivering IPA gateway reviews, the audit trail IS the professional work product. It proves systematic review methodology was applied.

What Gets Recorded

Every interaction records the following data:

Field Description
Timestamp When the interaction occurred (date and time to the second)
User Who performed the interaction (name and role)
Action Type Query, Challenge, Accept, Force Override, Add Context, or Dismiss
Comment What the reviewer wrote — the question, justification, or context
Original Rating The rating before the interaction
New Rating The rating after (null if unchanged)
AI Response The system's explanation, re-assessment result, or acknowledgement
Duration How long the system took to respond (in milliseconds)

The Audit Timeline

In the Audit tab of the side drawer, interactions appear in reverse chronological order. Here is an example for criterion GR-3.1:

GR-3.1: Robust cost estimation methodology
Rating: REDAMBER (challenged)
[Accept] Sarah Chen
"Confirmed after reviewing supporting documents"
[Challenge → AMBER] James Reece
"QRA exists but not independently verified"
AI: "Rating updated to AMBER. While a QRA methodology is documented, the lack of independent verification represents a material weakness..."
[Query] James Reece
"Why RED when there is a QRA in the FBC?"
AI: "The RED rating reflects two factors: (1) the QRA uses 2024 base costs without indexation..."
[System] Assessment v3 completed
Original rating: RED | Confidence: 0.85

Why It Matters

The interaction audit trail provides something consultancies have never had before: perfect traceability of the review process. When a client or IPA assessor asks "how did you arrive at this rating?", the answer is not a verbal explanation — it is a timestamped record of every query, challenge, resolution, and sign-off.

Interactive Mockup
See the full audit timeline mockup at programmeinsights.co.uk/mockup-hitl-audit-timeline.html

7. Review Progress and Analytics

Track your review progress at a glance. Visual indicators on every criterion show its interaction state, and a progress bar per category tells you how much of the assessment has been reviewed.

Progress Bar

A review progress bar sits at the top of each category in the criteria browser:

Financial Case 12 of 24 criteria reviewed (50%)

The progress bar gives you a sense of completion and helps you track where you left off between sessions.

Visual Indicators Per Criterion

Each criterion row in the browser shows its current interaction state with an icon:

Icon State Meaning
(none) Unreviewed No human interaction yet
Accepted Reviewer confirmed the finding
💬 Has comments Query, context, or challenge exists on this criterion
Challenged & changed Rating was changed via a successful challenge
🛡 Force overridden Rating set by human professional judgement, overriding AI
ABC Dismissed Not applicable, with documented reason

Keyboard Shortcuts

When a criterion is focused in the browser, you can work through your review without touching the mouse:

Key Action
QOpen Query form
COpen Challenge form
AOpen Add Context form
EnterAccept (toggle reviewed status)
DDismiss with reason
OForce Override (when available)
↑ ↓Navigate between criteria
EscClose the interaction drawer

Batch Review

For categories where most criteria are straightforward, batch review saves time:

  1. Toggle Review Mode at the top of the criteria browser. Checkboxes appear on each criterion row.
  2. Select multiple criteria by checking boxes, or use Select All in Category.
  3. Click Accept All Selected to stamp all as reviewed in a single action, with bulk audit entries created.

You can also use sequential review mode: arrow keys or the "Next" button in the drawer advances to the next criterion without closing it. Work through an entire category without mouse interaction: review, accept (Enter), next (arrow), review, challenge (C), type, submit, next.

Interactive Mockup
See the review progress mockup at programmeinsights.co.uk/mockup-hitl-review-progress.html

8. How It Flows Into Reports

Your review interactions do not disappear into a database. They appear in three distinct places in your PDF reports, turning AI assessment into a documented professional review.

1. Inline Reviewer Notes on Findings

Each finding that has been reviewed shows a Reviewer Notes section beneath the assessment justification in the Full Assessment report:

GR-3.1: Robust cost estimation methodology
Rating: AMBER (revised from RED)
Assessment: The cost estimation methodology documented in FBC Section 4.2 demonstrates a structured QRA-based approach...
Reviewer Notes
• Challenged by J. Reece (15/04/2026): "QRA exists but not independently verified." Rating revised RED → AMBER.
• Accepted by S. Chen (15/04/2026): Confirmed after review.

2. Review Summary Section

A new section appears between the Executive Summary and Detailed Findings in all report types:

Review Summary
Assessment reviewed by James Reece (Head of PMO), Sarah Chen (Cost Analyst)
Review period 14–15 April 2026
Criteria reviewed 24 of 24 (100%)
Ratings challenged 3 (2 changed, 1 upheld)
Criteria dismissed 1 (GR-4.7 — PFI criterion, not applicable)
Additional context 5 annotations

3. Full Interaction Log (Premium Reports)

For assurance-grade reports, an appendix lists every interaction with full text. This is the work product that consultancies attach to their deliverables — the equivalent of a traditional review workbook, but with perfect traceability.

The appendix includes:

The Report IS the Work Product
A completed PI assessment with full reviewer interactions is equivalent to a traditional review workbook — structured assessment, evidence citations, professional sign-off, challenge resolution, and override justification — but with perfect traceability and produced in days rather than weeks.

9. Getting Started

A practical checklist to work through your first review. Start with the category that matters most and work outward.

  1. Open your assessment results. Navigate to the assessment you want to review from your dashboard. The criteria browser shows all findings grouped by category.
  2. Start with the category that matters most. If you are preparing for an IPA gateway, start with the category most likely to draw scrutiny. The progress bar at the top of each category shows 0% reviewed — this is your starting point.
  3. Work through criteria: accept the clear ones, query the uncertain ones. For findings where the AI's assessment matches your view, hit Enter to accept and move on. For findings where you want to understand the reasoning, press Q to query.
  4. Challenge findings you disagree with — provide evidence. When you see a rating that does not match your professional judgement, press C to challenge. Select your proposed rating and write your justification. The more specific your reasoning, the better the re-assessment.
  5. Use Force Override only when your professional judgement demands it. If the system rejects your challenge and you still disagree, override it. Write a clear justification — this appears in the audit trail and reports.
  6. Check the progress bar — aim for 100% reviewed. A fully reviewed assessment produces the strongest reports. The progress bar tracks your completion per category.
  7. Generate your report — the audit trail appears automatically. When you generate a PDF report, all reviewer notes, the review summary, and the interaction log (premium reports) are included without any extra steps.
Typical Review Session
A typical assessment has 24–60 criteria per category. A thorough review usually involves accepting 15–20 criteria, querying 5–10, challenging 3–5, and adding context to a handful. Most reviewers complete a category in 30–45 minutes.

Quick Reference: Action Summary

Most Common Workflow

1. Scan criteria in the browser

2. Accept obvious findings (Enter)

3. Query anything unclear (Q)

4. Challenge disagreements (C)

5. Override if needed (O)

6. Generate report

Key Shortcuts

Enter — Accept

Q — Query

C — Challenge

A — Add Context

D — Dismiss

O — Force Override

Esc — Close drawer