Human-in-the-Loop Review

1. Why Human-in-the-Loop Review

AI assessment without traceable human oversight is just a chatbot with a dashboard. Consultancies need to prove their review methodology. This feature makes the human review traceable, defensible, and auditable.

The Problem

When a consultancy delivers an IPA gateway review report, they need to demonstrate four things:

Every finding was reviewed by a qualified professional
Challenges were considered and either incorporated or explained
Professional judgement was applied — the AI did not simply produce a report that was rubber-stamped
The review process was systematic — not ad hoc

Without Programme Insights, a consultancy's review workbook is an Excel spreadsheet with criteria, RAG ratings, and short comments. No traceability of how ratings were determined. No evidence of systematic review methodology. Typically produced by 2–3 consultants over 2–3 weeks.

With human-in-the-loop review, the same workbook becomes a structured assessment with full evidence citations, a complete audit trail showing who reviewed what, when, and why, and traceable challenges with resolutions. Produced in 2–3 days with 1–2 reviewers using AI as the first-pass assessor.

The Escalation Model

Programme Insights follows an escalation-based model: the AI handles the assessment autonomously, and human reviewers focus on the exceptions — the findings that need professional judgement.

The Stanford Enterprise AI Playbook (March 2026), studying 51 enterprise deployments, found that escalation-based models deliver significantly higher productivity gains than approval-only models where humans must sign off every output.

Key Statistic

71% median productivity gain when AI handles the baseline and humans review exceptions — versus 30% for approval-only models where every output requires human sign-off. Source: Stanford Enterprise AI Playbook, March 2026

In practice, the AI handles the 80%+ of criteria where the evidence is clear. The human reviewer focuses on the 15–20% that need professional judgement. The audit trail proves the methodology was applied.

2. The Six Interaction Types

Every finding in your assessment results is interactive. You can query it, challenge it, add context, accept it, dismiss it, or override it. Each action triggers a different system response and is recorded in the audit trail.

Action	What It Does	Changes Rating?	Cost
Accept	Stamps the finding as reviewed and agreed. The professional sign-off. Most frequent action, lowest friction — a single checkmark toggle.	No	Zero — database write only
Query	Ask the system to explain its reasoning. The response is grounded in retrieved evidence, the scoring rubric, and calibration rules. Read-only — no rating impact.	Never	Single LLM call, ~3–8 seconds
Challenge	Provide reasoning or additional context that should change the rating. Triggers a targeted re-assessment of that single criterion. You propose a new rating and justify it.	Potentially	Re-retrieval + LLM scoring, ~15–30 seconds
Force Override	Override the AI's rating based on professional judgement. Only available after a rejected challenge. Requires mandatory justification (minimum 50 characters). The resulting badge displays "Human Override".	Always	Zero — database write only
Add Context	Provide information not in the uploaded documents — verbal updates, recent decisions, known issues. Appears as a reviewer annotation in reports. Does not trigger re-assessment.	No	Zero — database write only
Dismiss	Mark a criterion as not applicable with a reason. Dismissed criteria are excluded from aggregate scoring, greyed out in the interface, and clearly marked in reports.	Removes from aggregation	Zero — deterministic recalculation

The Action Strip

Each criterion row in the assessment view has a persistent action strip on the right side:

Query Challenge Add Context More ↓ ✓ Accept

Query, Challenge, Add Context are always visible as buttons
More dropdown contains Dismiss (Force Override only appears after a rejected challenge)
Accept is a standalone checkmark toggle — most frequent action, needs lowest friction

Example: Query in Action

You ask: "Why did you rate this GREEN when the cost estimate is from 2024?"

System responds: "This criterion assesses whether a cost estimation methodology exists, not whether estimates are current. The FBC Section 4.2 demonstrates a structured QRA-based methodology with P50/P80 ranges. The age of the estimate would be captured under criterion GR-3.4 (Cost estimate currency)."

3. The Review Interface

The review interface uses a side drawer pattern — the criteria list stays visible on the left while the interaction content opens on the right. You never lose context of what you are reviewing.

Why a Side Drawer

Not a modal — you need to see the finding you are interacting with. A modal obscures context.
Not inline expansion — interaction content (AI explanations, re-assessment results, audit history) can be lengthy. Expanding inline would push other findings out of view.
Not a floating chat — interactions are anchored to specific findings, not free-form. This is structured professional review, not conversation.

Layout Overview

Criteria Browser (60%)

[Category: Financial Case]

RED GR-3.1 Robust cost estimation methodology →

Evidence: FBC S4.2, Cost Review Memo • Gap: No QRA or P50 analysis

AMB GR-3.2 Cost contingency and risk allowance

GRN GR-3.3 Funding approval and affordability

RED GR-3.4 Cost estimate currency and indexation

Interaction Drawer (40%)

GR-3.1: Cost estimation methodology

RED v3

Explain History Audit

Assessment justification, key evidence, gaps, and guidance reference appear here...

[User input area]

Submit Cancel

Drawer Content

When you click on any criterion, the drawer opens with three tabs:

Explain Tab (Default)

Shows the assessment justification, key evidence retrieved, gaps identified, and the guidance reference used. Includes a "ask a follow-up" input for queries.

History Tab

Shows rating changes over time — across assessment runs and any challenge-driven re-assessments. Each version shows the rating, date, and what triggered the change.

Audit Tab

Shows all interactions on this criterion by all users, newest first. Every query, challenge, context addition, acceptance, and override is listed with timestamps, user names, and full text.

Opening and Closing

Click any criterion row to open the drawer with that finding's detail
Click a different criterion to switch the drawer content without closing it
Press Esc or click outside the drawer to close it
The drawer slides in from the right (300ms ease-out transition)

Interactive Mockup

See the full interactive mockup at programmeinsights.co.uk/mockup-hitl-side-drawer.html

4. Challenging a Finding

When you disagree with a finding, challenge it. The system re-assesses with your input and either changes the rating or explains why the original stands. Both outcomes are recorded.

Step-by-Step Walkthrough

You are scanning the criteria browser. You see GR-3.1: Robust cost estimation methodology rated RED. Based on your experience, you believe this should be AMBER.
Click the Challenge button on the GR-3.1 row. The side drawer slides open with the challenge form pre-focused.
Select your proposed rating from the dropdown: AMBER. Write your justification in the text area (minimum 20 characters — forces substantive reasoning).
Click Submit Challenge. The button changes to a spinner: "Re-assessing with your input... (15–30s)". The system runs a targeted re-assessment of that single criterion, injecting your challenge text as additional context.
Result appears. One of two outcomes:
- Rating changed: Banner shows "Rating updated: RED → AMBER. See explanation below." The new justification appears with your context highlighted.
- Rating unchanged: Banner shows "Rating remains RED after re-assessment. See explanation below." The AI explains why the original rating holds despite your input.

Example Challenge

You write: "This should be AMBER. The QRA was done by the delivery team, not an independent estimator. IPA would flag this."

System responds (rating changed): "Rating updated to AMBER. While a QRA methodology is documented in FBC Section 4.2, the lack of independent verification represents a material weakness that would be flagged in an IPA Gate 3 review."

When the System Disagrees

This is the most important UX moment. The system does not simply say "no". It:

Acknowledges your point — never dismisses it
Explains the specific evidence that supports the original rating
Distinguishes scope — often your concern is valid but applies to a different criterion
Offers alternatives — "Your concern about independent verification may be better captured under criterion GR-3.4. Would you like to review that criterion?"

By Design

The system can disagree with you. This is intentional — it maintains the integrity of the assessment. If you still disagree after a rejected challenge, you have Force Override (see Section 5).

What Happens Under the Hood

The re-assessment does not re-run the full assessment pipeline (that would take 2–5 minutes). It runs a targeted 3-step process:

1. Re-Retrieve

If your challenge references specific documents or evidence, a targeted retrieval runs against your claims. Skipped if the challenge is purely interpretive.

2. Re-Score

The same scoring rubric is applied, but with your challenge text injected as additional context. The system considers your input fairly — it does not defer simply because you challenged.

3. Compare & Record

The new rating is compared to the original. If changed, the criterion updates and aggregates recalculate. Either way, the full re-assessment response is stored in the audit trail.

5. Force Override

When the AI disagrees with your challenge but your professional judgement takes precedence, Force Override is your final authority. It is the safety valve that ensures human expertise always wins.

When It Becomes Available

Force Override is not available as a first action. It only appears after a challenge has been rejected by the system. This sequence is deliberate:

AI Rating → Challenge Submitted → AI Rejects Challenge → Force Override Available

This means every override has context: you challenged, the system explained why it disagreed, and you still chose to override. The full chain is preserved.

How to Override

After your challenge is rejected, the Force Override button appears below the system's response in the drawer.
Click Force Override. A form appears with a dropdown for your chosen rating and a justification text area.
Write your justification — minimum 50 characters. This must explain your professional reasoning. One-word justifications are not accepted.
Click Confirm Override. The rating changes immediately to your specified value.

What Changes

The criterion's RAG badge updates to your chosen rating
A Human Override badge appears on the finding with a distinct visual marker
Category and overall ratings recalculate to reflect the change
The full chain is recorded in the audit trail: AI rating → your challenge → AI rejection reasoning → your override with justification

Example Override Justification

"Overriding to AMBER. As a chartered cost consultant with 20 years' experience, the absence of independent QRA verification is a material weakness that IPA would flag at Gate 3. The AI's assessment of the methodology as adequate does not account for IPA's expectation of independence."

Rules and Limits

Rule	Detail
One override per criterion	Once overridden, the rating is final. No re-overriding.
Maximum 3 challenges before override	After 3 rejected challenges, Force Override is the only remaining path. The system prompts: "You can Force Override the rating with your professional justification."
Reviewer or Admin role required	Viewers cannot override. Only users with Reviewer, Admin, or Project Owner roles.
Minimum 50-character justification	The justification text appears in the audit trail and reports. It must be substantive.

The Audit Trail Records Everything

The override, the AI's reasoning for its original rating, the AI's reasoning for rejecting your challenge, and your justification for overriding — all appear in the audit trail. Nothing is hidden.

6. The Audit Trail

The audit trail is not a compliance checkbox. For consultancies delivering IPA gateway reviews, the audit trail IS the professional work product. It proves systematic review methodology was applied.

What Gets Recorded

Every interaction records the following data:

Field	Description
Timestamp	When the interaction occurred (date and time to the second)
User	Who performed the interaction (name and role)
Action Type	Query, Challenge, Accept, Force Override, Add Context, or Dismiss
Comment	What the reviewer wrote — the question, justification, or context
Original Rating	The rating before the interaction
New Rating	The rating after (null if unchanged)
AI Response	The system's explanation, re-assessment result, or acknowledgement
Duration	How long the system took to respond (in milliseconds)

The Audit Timeline

In the Audit tab of the side drawer, interactions appear in reverse chronological order. Here is an example for criterion GR-3.1:

GR-3.1: Robust cost estimation methodology

Rating: RED → AMBER (challenged)

[Accept] Sarah Chen

15 Apr 2026, 14:32

"Confirmed after reviewing supporting documents"

[Challenge → AMBER] James Reece

15 Apr 2026, 11:45

"QRA exists but not independently verified"

AI: "Rating updated to AMBER. While a QRA methodology is documented, the lack of independent verification represents a material weakness..."

[Query] James Reece

15 Apr 2026, 09:15

"Why RED when there is a QRA in the FBC?"

AI: "The RED rating reflects two factors: (1) the QRA uses 2024 base costs without indexation..."

[System] Assessment v3 completed

14 Apr 2026, 22:00

Original rating: RED | Confidence: 0.85

Why It Matters

The interaction audit trail provides something consultancies have never had before: perfect traceability of the review process. When a client or IPA assessor asks "how did you arrive at this rating?", the answer is not a verbal explanation — it is a timestamped record of every query, challenge, resolution, and sign-off.

Interactive Mockup

See the full audit timeline mockup at programmeinsights.co.uk/mockup-hitl-audit-timeline.html

7. Review Progress and Analytics

Track your review progress at a glance. Visual indicators on every criterion show its interaction state, and a progress bar per category tells you how much of the assessment has been reviewed.

Progress Bar

A review progress bar sits at the top of each category in the criteria browser:

Financial Case 12 of 24 criteria reviewed (50%)

The progress bar gives you a sense of completion and helps you track where you left off between sessions.

Visual Indicators Per Criterion

Each criterion row in the browser shows its current interaction state with an icon:

Icon	State	Meaning
(none)	Unreviewed	No human interaction yet
✓	Accepted	Reviewer confirmed the finding
💬	Has comments	Query, context, or challenge exists on this criterion
⇔	Challenged & changed	Rating was changed via a successful challenge
🛡	Force overridden	Rating set by human professional judgement, overriding AI
ABC	Dismissed	Not applicable, with documented reason

Keyboard Shortcuts

When a criterion is focused in the browser, you can work through your review without touching the mouse:

Key	Action
`Q`	Open Query form
`C`	Open Challenge form
`A`	Open Add Context form
`Enter`	Accept (toggle reviewed status)
`D`	Dismiss with reason
`O`	Force Override (when available)
`↑ ↓`	Navigate between criteria
`Esc`	Close the interaction drawer

Batch Review

For categories where most criteria are straightforward, batch review saves time:

Toggle Review Mode at the top of the criteria browser. Checkboxes appear on each criterion row.
Select multiple criteria by checking boxes, or use Select All in Category.
Click Accept All Selected to stamp all as reviewed in a single action, with bulk audit entries created.

You can also use sequential review mode: arrow keys or the "Next" button in the drawer advances to the next criterion without closing it. Work through an entire category without mouse interaction: review, accept (Enter), next (arrow), review, challenge (C), type, submit, next.

Interactive Mockup

See the review progress mockup at programmeinsights.co.uk/mockup-hitl-review-progress.html

8. How It Flows Into Reports

Your review interactions do not disappear into a database. They appear in three distinct places in your PDF reports, turning AI assessment into a documented professional review.

1. Inline Reviewer Notes on Findings

Each finding that has been reviewed shows a Reviewer Notes section beneath the assessment justification in the Full Assessment report:

GR-3.1: Robust cost estimation methodology

Rating: AMBER (revised from RED)

Assessment: The cost estimation methodology documented in FBC Section 4.2 demonstrates a structured QRA-based approach...

Reviewer Notes

• Challenged by J. Reece (15/04/2026): "QRA exists but not independently verified." Rating revised RED → AMBER.

• Accepted by S. Chen (15/04/2026): Confirmed after review.

2. Review Summary Section

A new section appears between the Executive Summary and Detailed Findings in all report types:

Review Summary

Assessment reviewed by	James Reece (Head of PMO), Sarah Chen (Cost Analyst)
Review period	14–15 April 2026
Criteria reviewed	24 of 24 (100%)
Ratings challenged	3 (2 changed, 1 upheld)
Criteria dismissed	1 (GR-4.7 — PFI criterion, not applicable)
Additional context	5 annotations

3. Full Interaction Log (Premium Reports)

For assurance-grade reports, an appendix lists every interaction with full text. This is the work product that consultancies attach to their deliverables — the equivalent of a traditional review workbook, but with perfect traceability.

The appendix includes:

Every interaction, chronologically ordered, per criterion
Full text of reviewer comments and AI responses
Rating change history with before/after values
Override justifications with the full challenge-rejection-override chain
Dismiss reasons for excluded criteria

The Report IS the Work Product

A completed PI assessment with full reviewer interactions is equivalent to a traditional review workbook — structured assessment, evidence citations, professional sign-off, challenge resolution, and override justification — but with perfect traceability and produced in days rather than weeks.

9. Getting Started

A practical checklist to work through your first review. Start with the category that matters most and work outward.

Open your assessment results. Navigate to the assessment you want to review from your dashboard. The criteria browser shows all findings grouped by category.
Start with the category that matters most. If you are preparing for an IPA gateway, start with the category most likely to draw scrutiny. The progress bar at the top of each category shows 0% reviewed — this is your starting point.
Work through criteria: accept the clear ones, query the uncertain ones. For findings where the AI's assessment matches your view, hit Enter to accept and move on. For findings where you want to understand the reasoning, press Q to query.
Challenge findings you disagree with — provide evidence. When you see a rating that does not match your professional judgement, press C to challenge. Select your proposed rating and write your justification. The more specific your reasoning, the better the re-assessment.
Use Force Override only when your professional judgement demands it. If the system rejects your challenge and you still disagree, override it. Write a clear justification — this appears in the audit trail and reports.
Check the progress bar — aim for 100% reviewed. A fully reviewed assessment produces the strongest reports. The progress bar tracks your completion per category.
Generate your report — the audit trail appears automatically. When you generate a PDF report, all reviewer notes, the review summary, and the interaction log (premium reports) are included without any extra steps.

Typical Review Session

A typical assessment has 24–60 criteria per category. A thorough review usually involves accepting 15–20 criteria, querying 5–10, challenging 3–5, and adding context to a handful. Most reviewers complete a category in 30–45 minutes.

Quick Reference: Action Summary

Most Common Workflow

1. Scan criteria in the browser

2. Accept obvious findings (Enter)

3. Query anything unclear (Q)

4. Challenge disagreements (C)

5. Override if needed (O)

6. Generate report

Key Shortcuts

Enter — Accept

Q — Query

C — Challenge

A — Add Context

D — Dismiss

O — Force Override

Esc — Close drawer

Table of Contents

1. Why Human-in-the-Loop Review

The Problem

The Escalation Model

2. The Six Interaction Types

The Action Strip

3. The Review Interface

Why a Side Drawer

Layout Overview

Drawer Content

Explain Tab (Default)

History Tab

Audit Tab

Opening and Closing

4. Challenging a Finding

Step-by-Step Walkthrough

When the System Disagrees

What Happens Under the Hood

1. Re-Retrieve

2. Re-Score

3. Compare & Record

5. Force Override

When It Becomes Available

How to Override

What Changes

Rules and Limits

6. The Audit Trail

What Gets Recorded

The Audit Timeline

Why It Matters

7. Review Progress and Analytics

Progress Bar

Visual Indicators Per Criterion

Keyboard Shortcuts

Batch Review

8. How It Flows Into Reports

1. Inline Reviewer Notes on Findings

2. Review Summary Section

3. Full Interaction Log (Premium Reports)

9. Getting Started

Quick Reference: Action Summary

Most Common Workflow

Key Shortcuts