How the Assessment Pipeline Works

Introduction

Why structured assessment matters — and why "AI read your document" is not the same as evidence-based assessment with guardrails.

Most AI document tools work like this: they feed your document into a language model, ask it to evaluate against some criteria, and return whatever the model says. No verification. No structure. No way to know if the answer is grounded in evidence or fabricated from training data.

That approach fails for serious assessment work. When a programme director needs to know whether their business case addresses all five HMT Green Book dimensions, or whether an IPA gateway submission covers the required delivery confidence criteria, "the AI thinks it looks fine" is not an acceptable answer.

What Makes This Different

Programme Insights uses a multi-step assessment pipeline that separates evidence retrieval from judgement, applies deterministic guardrails over AI scoring, and optionally runs adversarial verification on borderline results. Every rating traces back to specific passages in your documents.

The Core Principle

The system does not trust the AI. It retrieves evidence, scores it with structured output, then applies rule-based guardrails that can override the AI's rating. Every claim maps to a citation. Borderline results get debated by specialist personas.

This guide explains each step of that pipeline: what it does, why it makes assessment more accurate, and what goes wrong without it. It is written for technical evaluators who need to understand the mechanism, not just the marketing.

Pipeline Overview

Three layers work together: orchestration for reliability, standard assessment for every criterion, and extended verification for high-stakes results.

Architecture at a Glance

Layer	Steps	Purpose
Orchestration	10 steps	Run management, crash recovery, progress streaming, criteria loading
Standard Tier	2 steps per criterion	Evidence retrieval + scoring with deterministic guardrails
Extended Tier	4 additional steps	Corrective RAG, citation verification, self-critique, adversarial debate

Three LLM Tiers

Model	Used For	Why This Model
GPT-4o	Scoring, summary, debate	Strongest reasoning for judgement tasks. Temperature 0.0 for reproducibility
GPT-4.1-mini	CRAG, critique, citation verification	Fast, cost-effective for classification and verification tasks
text-embedding-3-small	Vector embeddings	High-quality semantic search at low latency

Standard Tier

Runs for every criterion in every assessment. Two steps: retrieve evidence, then score it.

Hybrid search (semantic + keyword)
Query expansion with 2-3 variants
Reciprocal Rank Fusion
Structured scoring with guardrails

Extended Tier

Activates when results need deeper scrutiny. Four additional verification steps.

Corrective RAG for poor retrieval
Verbatim citation checking
Self-critique below 70% confidence
Multi-persona debate on AMBER results

Orchestration Layer

The infrastructure that ensures assessments are reliable, resumable, and observable — before a single criterion is evaluated.

An assessment run against a complex framework can take several minutes. Documents may be large, criteria sets can include 50+ items, and network or API failures happen. The orchestration layer handles all of this so that the assessment logic never has to.

Duplicate Run Prevention A 10-minute window prevents the same assessment from running twice. If a user accidentally double-clicks, the second request is rejected. This prevents wasted compute and conflicting results.
Stale Run Resumption If a previous run crashed mid-assessment, it is detected and resumed rather than restarted. Completed criteria are kept; only the remaining criteria are re-assessed. This turns a 15-minute failure into a 2-minute recovery.
Create Assessment Run Record An atomic version number is assigned. Every result in this run links back to this record, creating a complete audit trail. You can compare run v3 against run v1 and see exactly what changed.
Initialise Progress Tracker Server-Sent Events (SSE) stream real-time progress to the UI. Users see each criterion being assessed as it happens, not a spinner followed by a data dump.
Load Criteria The system routes to the correct criteria source: a pre-built module pack (IPA, Green Book, NEC), user-defined criteria, or a custom framework. It builds a search configuration with semantic weight 0.7, keyword weight 0.3, and a relevance threshold of 0.35.

Why Search Weighting Matters

Semantic search (0.7 weight) catches evidence expressed in different words — "cost-benefit analysis" matching "economic appraisal". Keyword search (0.3 weight) catches exact terms that semantic models sometimes miss — specific clause numbers, acronyms, regulatory references. The 0.35 threshold filters out noise without discarding borderline-relevant evidence.

After Assessment Completes

Steps 7-8 handle post-assessment work. The Summary Generator creates a Delivery Confidence Assessment (DCA) rating and an Executive Summary by aggregating individual criterion scores. The Finalise Run step marks the run as complete, records timing metrics, and triggers any downstream notifications.

Steps 9-10 provide structured error handling: per-criterion timeouts (300s standard, 480s extended), graceful degradation when individual criteria fail, and comprehensive error logging for debugging.

Standard Tier: Retrieval

Step 1 of every criterion assessment. Finding the right evidence is harder than judging it — and most AI tools get this wrong.

The Problem with Single-Query Search

If you ask an LLM "does this document address stakeholder engagement?", it searches for those exact words. But the evidence might be described as "community consultation", "public participation", or "interested party management". A single query misses evidence that exists but uses different language.

How Programme Insights Retrieves Evidence

1. Query Expansion

For each criterion, the system generates 2-3 query variants that capture different ways the same concept might be expressed. A criterion about "risk management" might expand to queries about "risk register", "mitigation strategies", and "threat assessment".

2. Hybrid Search

Each query variant runs against two search systems simultaneously:

Search Type	Technology	What It Catches	What It Misses
Semantic	pgvector embeddings	Conceptually similar text, paraphrases, domain synonyms	Exact terms, clause numbers, acronyms
Keyword	Full-text search	Precise terms, regulatory references, specific numbers	Rephrased concepts, alternative terminology

3. Reciprocal Rank Fusion (RRF)

Results from all query variants and both search types are combined using RRF. This algorithm merges ranked lists by giving higher scores to chunks that appear in multiple result sets. A chunk that ranks #3 in semantic search and #5 in keyword search gets a combined score higher than a chunk that ranks #1 in only one system.

4. Output: 15 Ranked Chunks

The retrieval step produces 15 ranked evidence chunks, each with a relevance score, source document reference, and page location. These chunks — not the full documents — are what the scoring step evaluates.

What Goes Wrong Without This

Single-query, single-method search typically finds 40-60% of relevant evidence. That means the scoring step is making judgements based on incomplete information. Your document might thoroughly address a criterion, but if the search only finds half the evidence, it gets scored as partially compliant. Query expansion + hybrid search + RRF consistently retrieves 85-95% of relevant evidence.

Standard Tier: Scoring

Step 2 of every criterion assessment. The AI scores the evidence — but deterministic rules have the final word.

Structured Output at Temperature 0.0

GPT-4o evaluates the 15 retrieved evidence chunks against the criterion using structured output mode — the model returns a defined JSON schema, not free text. Temperature is set to 0.0, meaning the same evidence produces the same score every time. No creative interpretation. No hallucinated confidence.

Dual Rating System

For each criterion, the model produces two independent ratings:

Rating	What It Measures	Scale
Coverage	Does the evidence address the criterion? How thoroughly?	RED / AMBER / GREEN
Delivery Confidence	Is the approach credible? Are the plans realistic and achievable?	RED / AMBER / GREEN

Coverage answers "is it there?" Delivery confidence answers "is it good?" A business case might thoroughly cover financial projections (GREEN coverage) but with unrealistic assumptions (RED delivery confidence). These are different problems requiring different responses.

Analyst Personas

The scoring prompt instructs GPT-4o to adopt the role of a specialist assessor appropriate to the framework being used. For IPA gateway assessments, it acts as an experienced IPA reviewer. For Green Book appraisals, it acts as a Treasury-trained economist. This persona framing improves the relevance and rigour of the assessment without changing the deterministic guardrails that override the AI's output.

Why Two Ratings Matter

A document that mentions every required topic (GREEN coverage) but with vague, aspirational language and no concrete plans (RED delivery confidence) needs a different remediation strategy than one with solid plans that simply forgot to address a criterion (RED coverage). The dual rating tells you what kind of work is needed.

Deterministic Guardrails: Scoring Rules

The system does not trust the AI. It verifies. These rules are hard-coded — no prompt engineering can override them.

After GPT-4o produces its structured score, the result passes through a set of deterministic rules. These are not AI suggestions. They are if/then logic that runs in application code. The AI cannot talk its way past them.

Sub-Question Coverage Calculation

Each criterion can be broken into sub-questions. The system scores each sub-question individually, then calculates a coverage percentage:

Sub-Question Result	Score Assigned	Example
Yes — fully addressed	100%	Risk register exists with mitigations and owners
Partial — mentioned but incomplete	70%	Risks listed but no mitigation strategies
No — not addressed	0%	No mention of risk management

RAG Rating Thresholds

The sub-question scores are averaged into a coverage percentage, which maps to a RAG rating:

Coverage %	Rating	Meaning
70% and above	GREEN	Criterion is well addressed. Evidence is comprehensive.
40% – 69%	AMBER	Partially addressed. Gaps exist that need attention.
Below 40%	RED	Not adequately addressed. Significant work required.

Why Fixed Thresholds

These thresholds are not tunable per-run. Every assessment uses the same boundaries. This makes results comparable across documents, across time, and across assessors. When a criterion moves from AMBER to GREEN between versions, you know the evidence genuinely improved — the goalposts did not move.

Deterministic Guardrails: Override Rules

Four rules that can override the AI's rating. They catch the cases where LLMs are systematically wrong.

LLMs have known failure modes in assessment work: they are overconfident about vague text, they miss the difference between aspirational language and concrete plans, and they sometimes rate highly when retrieval found very little evidence. These four guardrails catch those failures.

Guardrail A — Upgrade

Strong Coverage, Minor Concerns

Trigger: Coverage is GREEN AND negative evidence is 20% or less AND delivery confidence is AMBER.

Action: Override to GREEN.

Why: The AI is being cautious about minor delivery concerns when the evidence is overwhelmingly positive. A few hedged phrases should not downgrade a well-evidenced criterion.

Guardrail B — Downgrade

Good Coverage, Poor Delivery

Trigger: Coverage is GREEN AND delivery confidence is RED.

Action: Override to AMBER.

Why: The document talks about the topic extensively but the plans are not credible. The AI might rate GREEN because the words are there, but the substance is not.

Guardrail C — Upgrade

Poor Coverage, Strong Delivery

Trigger: Coverage is RED AND delivery confidence is GREEN.

Action: Override to AMBER.

Why: The evidence is sparse but what exists is strong. This often means the document addresses the criterion well but in a different section or using unexpected terminology. Worth flagging for review rather than marking as a failure.

Guardrail D — Downgrade

Poor Retrieval, Optimistic Rating

Trigger: Retrieval quality is flagged as poor AND delivery confidence is GREEN.

Action: Override to AMBER.

Why: If the system could not find much evidence but the AI still rates GREEN, something is wrong. The AI may be drawing on training data rather than your document. This guardrail forces a conservative rating when the evidence base is thin.

Worked Example

Criterion: Stakeholder Engagement Plan

Retrieved evidence: 11 of 15 chunks are relevant. Document has a dedicated stakeholder section.
Sub-questions: Stakeholder mapping (Yes, 100%) + Engagement methods (Partial, 70%) + Feedback mechanisms (Yes, 100%) + Communication plan (No, 0%)
Coverage score: (100 + 70 + 100 + 0) / 4 = 67.5% = AMBER
AI delivery confidence: GREEN — "the approach is well-structured"
Guardrails check: No guardrail triggers (coverage is AMBER, not RED/GREEN extremes)
Final rating: AMBER — missing communication plan drops this below GREEN despite strong delivery

Extended Tier: Corrective RAG

When the retrieval step fails to find enough evidence, CRAG diagnoses why and tries again with better queries.

The Problem

Even with query expansion and hybrid search, retrieval can fail. The document might use highly specialised terminology, the relevant section might be embedded in an appendix, or the criterion might span multiple disconnected sections. The standard retrieval step does not know whether it succeeded — it returns 15 chunks regardless of quality.

How CRAG Works

CRAG (Corrective Retrieval-Augmented Generation) adds a quality check after retrieval. GPT-4.1-mini classifies the retrieval result into one of three categories:

Classification	What It Means	What Happens Next
Correct	Retrieved chunks are relevant and sufficient	Proceed to scoring with current evidence
Ambiguous	Some relevant chunks, but gaps exist	Decompose criterion into sub-queries, re-retrieve for gaps
Incorrect	Retrieved chunks are mostly irrelevant	Generate entirely new queries, re-retrieve from scratch

Sub-Query Decomposition

When classified as ambiguous, CRAG breaks the criterion into more specific sub-queries. For example, a criterion about "project governance arrangements" might decompose into:

Governance structure and roles Search for: board composition, reporting lines, decision authority, RACI matrix
Meeting and review cadence Search for: board meetings, gateway reviews, stage-gate process, review schedule
Escalation and assurance Search for: escalation procedures, independent assurance, audit arrangements

Each sub-query runs through the same hybrid search pipeline. The results are merged with the original retrieval using RRF, producing a richer evidence set for scoring.

What Goes Wrong Without This

Without CRAG, poor retrieval flows silently into scoring. The scoring model rates based on whatever was retrieved, producing confident-sounding assessments of the wrong evidence. Guardrail D catches some of these cases, but CRAG prevents the problem at source rather than patching it downstream.

Extended Tier: Citation Verification

Every claim the system makes must trace back to your document. This step ensures it does.

The Hallucination Problem

Stanford research found that LLMs hallucinate in 17-33% of legal document analysis tasks. The model generates text that sounds authoritative but does not appear in the source material. In assessment work, this means a criterion might be rated GREEN based on evidence the AI invented rather than evidence your document contains.

This is not a theoretical risk. It happens routinely with off-the-shelf AI tools applied to document assessment.

How Citation Verification Works

After scoring, GPT-4.1-mini performs a verbatim check. For each claim in the assessment output, it:

Extracts the claim Identifies each factual assertion in the scoring output — "the document includes a cost-benefit analysis covering a 25-year appraisal period"
Locates the source chunk Matches the claim to the specific retrieved chunk it should be grounded in
Verifies the text Checks whether the claim is substantiated by the actual text of the source chunk — not a paraphrase of training data
Flags or confirms Claims that cannot be verified are flagged. If enough claims fail verification, the criterion score is adjusted downward

Why Verbatim Checking

Semantic similarity is not enough. A claim that "the document addresses environmental impact" could be semantically similar to a passage about "ecological considerations" — but if the passage does not actually discuss environmental impact in the way the criterion requires, the citation is misleading. Verbatim checking ensures the evidence genuinely supports the claim.

The Audit Trail

Every assessment result includes a citation chain: criterion → rating → supporting evidence → source document → page number. A reviewer can follow any rating back to the exact text that supports it. If the citation verification step flagged any issues, those flags are visible in the results.

Extended Tier: Self-Critique

When the system is uncertain, it challenges its own reasoning before committing to a score.

When Self-Critique Activates

Self-critique triggers when the scoring model's confidence falls below 70%. This is not a random threshold — it corresponds to the boundary where LLM scoring reliability drops measurably. Above 70%, the model's structured output is generally consistent and well-grounded. Below 70%, the model is uncertain, and its rating becomes less reliable.

The Three-Question Probe

When triggered, GPT-4.1-mini asks the scoring output three diagnostic questions:

Question 1: Evidence Sufficiency

"Is the evidence retrieved actually sufficient to support this rating, or is the model filling gaps with assumptions?"

Catches cases where the model scored based on what it expected to find rather than what was actually retrieved.

Question 2: Rating Justification

"Does the justification logically support the rating, or could the same evidence support a different rating?"

Catches circular reasoning — where the justification restates the rating rather than explaining it.

Question 3: Counter-Evidence

"Is there evidence in the retrieved chunks that contradicts this rating? Was it acknowledged or ignored?"

Catches confirmation bias — where the model focused on supporting evidence and ignored contradicting evidence.

What Happens After

If the self-critique identifies issues, the scoring step is re-run with the critique findings included as additional context. The model sees its own weaknesses and can correct for them. If the re-scored result still has low confidence, the criterion is flagged for human review in the final report.

Why This Works

Self-critique leverages a well-documented property of LLMs: they are better at evaluating reasoning than generating it. The scoring model might produce a shaky justification, but the critique model can reliably identify that it is shaky. The combination produces more honest uncertainty signals than either step alone.

Extended Tier: Adversarial Debate

AMBER results get stress-tested by specialist personas arguing opposing positions. The boundary between AMBER and GREEN (or AMBER and RED) is where assessments matter most.

Why AMBER Is the Danger Zone

GREEN and RED ratings are usually unambiguous. The evidence clearly addresses the criterion, or it clearly does not. AMBER is where the difficult judgement calls live: is this "partially addressed" or "nearly complete"? Is this a genuine gap or a terminology mismatch?

These boundary decisions are exactly where human assessors disagree with each other, and where AI assessors are least reliable. The debate step forces the system to stress-test its AMBER ratings before finalising them.

How the Debate Works

GPT-4o runs a multi-persona adversarial debate. The personas depend on the framework:

Framework	Debate Panel	Perspectives
IPA Gateway	3 IPA specialists	Delivery confidence, commercial viability, governance & assurance
General frameworks	2 generalists	One argues for upgrade (GREEN), one argues for downgrade (RED)

Debate Structure

Position Statements Each persona states their position on the criterion rating, citing specific evidence from the retrieved chunks. No general opinions — every argument must reference text from the document.
Cross-Examination Each persona challenges the others' arguments. "You cite paragraph 4.3, but it only describes the current state, not the planned approach. That is not evidence of a delivery plan."
Judgement A synthesiser weighs the arguments and produces a final recommendation: stay AMBER, upgrade to GREEN, or downgrade to RED. The reasoning is included in the assessment output.

What Goes Wrong Without This

Without adversarial debate, AMBER ratings are accepted without scrutiny. In practice, this means roughly 30% of AMBER ratings should be GREEN (the evidence is there but was under-weighted) and 10% should be RED (the evidence was over-interpreted). The debate step resolves these misclassifications before they reach the user.

Summary Generation & DCA

Individual criterion scores roll up into a Delivery Confidence Assessment and an Executive Summary that tells the story behind the numbers.

Dimension Grouping

Criteria are not treated as a flat list. They are grouped into dimensions — the higher-level categories that a framework uses. For IPA assessments, these might be: Strategic Case, Economic Case, Commercial Case, Financial Case, and Management Case. For custom frameworks, the grouping comes from the module pack or user-defined structure.

Each dimension gets its own summary score, derived from its constituent criteria. This reveals patterns: a business case might score GREEN on Strategic and Economic cases but RED on Management Case, signalling a specific weakness rather than overall failure.

Showstopper Detection

Some criteria are flagged as showstoppers in the module pack. These are non-negotiable requirements where a RED rating means the assessment cannot receive an overall GREEN regardless of how well other criteria score. Examples include:

Absence of a benefits realisation plan in a gateway review
No evidence of an approved budget in a business case
Missing safety case in nuclear or high-hazard assessments

DCA Rating

The overall Delivery Confidence Assessment combines dimension scores, showstopper flags, and a holistic judgement. It follows a rules + holistic model:

Rules Layer

Any RED showstopper forces overall AMBER or RED. All dimensions GREEN and no showstoppers allows overall GREEN. These are non-negotiable.

Holistic Layer

Within the boundaries set by rules, GPT-4o synthesises the criterion-level findings into a narrative. It identifies themes ("governance is strong but delivery planning is underdeveloped"), highlights the 3-5 most impactful findings, and recommends priority actions. This is the executive summary that programme leaders actually read.

Incremental Assessment

When your documents change, the system re-assesses only what is affected — not everything from scratch.

The Evidence Lineage Problem

Documents evolve between assessment runs. A section gets rewritten, new appendices are added, figures are updated. A full re-assessment every time is wasteful and makes it impossible to track what actually changed. You need to know: "this criterion moved from AMBER to GREEN because the stakeholder engagement section was expanded", not just "it is GREEN now".

How Incremental Assessment Works

Evidence Links Table

The system maintains an evidence_links table that records which document chunks contributed to which criterion scores. When a document is re-uploaded, the system identifies which chunks changed by comparing embeddings and content hashes.

Carry-Forward Check

At the start of each criterion assessment in a new run, the system checks: has any evidence linked to this criterion changed since the last run? If not, the previous score carries forward without re-assessment. This is step 6 of the orchestration layer — the carry-forward check runs before the agentic pipeline.

Targeted Re-Assessment

Only criteria with changed evidence go through the full pipeline (retrieval, scoring, guardrails, and optionally the extended tier). The result is a new run that includes both carried-forward scores and freshly assessed scores, all under a single version number.

Scenario	Full Re-Assessment	Incremental
50-criterion framework, 3 sections changed	~25 minutes, 50 LLM calls	~4 minutes, 8 LLM calls
Version comparison	Compare two full runs manually	System highlights exactly what changed and why
Audit trail	Two independent snapshots	Linked lineage showing score progression per criterion

Why This Matters for Programme Teams

Infrastructure programmes iterate documents through multiple review cycles. Being able to say "we addressed 12 of the 15 AMBER findings from the last assessment, and here is the evidence for each" is transformative for governance meetings. It turns assessment from a point-in-time snapshot into a continuous improvement tool.

How This Compares

Not all AI document assessment is created equal. Here is how Programme Insights compares to the alternatives.

Capability	Programme Insights	ChatGPT / Generic AI	Manual Assessment	Checkbox Tools
Evidence retrieval	Hybrid search + query expansion + RRF	Single-pass context window	Human reading	Keyword matching
Scoring method	Structured output + deterministic guardrails	Free-text opinion	Expert judgement	Binary pass/fail
Hallucination control	Citation verification + source tracing	None	N/A (human reads source)	None
Borderline handling	Adversarial debate with specialist personas	None	Panel discussion (if available)	None
Reproducibility	Temp 0.0 + fixed guardrails = same input, same output	Variable (temperature, prompt drift)	Low (assessor variability)	High (but shallow)
Audit trail	Criterion → evidence → source → page	None	Written report (if created)	Checkbox record
Incremental updates	Evidence lineage, targeted re-assessment	Full re-run required	Full re-read required	Full re-check required
Time for 50 criteria	8-15 minutes	Not feasible (context limits)	3-5 days	1-2 hours (shallow)
Framework awareness	IPA, Green Book, NEC, nuclear, custom	General knowledge only	Depends on assessor	Template-dependent
Cost per assessment	Minutes of compute	Subscription + time	Consultant day rates	Licence + time

The Real Difference

Generic AI tools give you an opinion. Programme Insights gives you an evidence-based assessment with an audit trail. The opinion might be right. The evidence-based assessment shows you why it is right — and gives you something you can defend in a governance meeting.

See It Working

A demo takes 30 minutes. Bring a document you are working on — we will assess it live against the framework of your choice.

programmeinsights.com

UK-HOSTED DATA SOC 2 ALIGNED NO DOCUMENT RETENTION

This guide explains the mechanism. The demo shows the results. If you assess documents against frameworks — whether IPA gateway reviews, Green Book appraisals, NEC compliance, or your own internal standards — this is built for you.

We built Programme Insights because we spent years watching good programmes get poor assessments due to time pressure, and watching AI tools produce confident nonsense. There is a better way.

James Cotton

Founder, Programme Insights

Other guides in this series:

What is Programme Insights? — Product overview for decision-makers

Security & Data Handling — For information security teams