White Paper

AI-Powered Assessment

A Practitioner's Guide to Structured Document Assessment for UK Infrastructure Programmes
James Reece

Founder, Programme Insights

Published April 2026 Pages 60+ programmeinsights.com

About the Author

James Reece

Founder, Programme Insights

James has spent over 15 years working in UK infrastructure programme management. His career includes roles on some of the UK's most significant projects: Heathrow Terminal 2, HS2, Network Rail, and nuclear decommissioning programmes. He has worked across programme controls, assurance, and PMO delivery on programmes ranging from hundreds of millions to tens of billions of pounds.

He built Programme Insights because he lived the problem it solves. Years of sitting in gateway review rooms, preparing document submissions, and watching experienced professionals spend weeks reading documents that should take hours. Watching organisations use ChatGPT for assurance and hoping nobody asked where the findings came from. Watching good programmes get RED ratings because evidence wasn't assembled properly, not because the work hadn't been done.

Programme Insights is the tool he wished existed when he was the one preparing submissions at 11pm the night before a review.

"This guide is written from the perspective of someone who has sat in gateway review rooms, prepared document submissions, and experienced firsthand the consequences of both thorough and inadequate assessment. It isn't academic theory. It's what I've learned from being in the room."

Contents

Part One: The Problem
Chapter 1 The Document Assessment Challenge
Chapter 2 Why Current Approaches Fall Short
Chapter 3 What Good Assessment Actually Looks Like
Part Two: The Method
Chapter 4 The Principles of Structured AI Assessment
Chapter 5 The Assessment Pipeline
Chapter 6 Accuracy Over Speed
Part Three: Application
Chapter 7 Use Cases Across Domains
Chapter 8 Both Sides of the Table
Chapter 9 The Configurable Engine Approach
Part Four: Reality Check
Chapter 10 What AI Assessment Cannot Do
Chapter 11 The Future of Document Assessment
Appendices
Appendix A Framework Reference
Appendix B Glossary
About Programme Insights
Part One

The Problem

Why document assessment on major programmes is broken, what organisations have tried, and what good actually looks like.

Chapter 1

The Document Assessment Challenge

The Scale of the Problem

A major UK infrastructure programme generates documentation measured in thousands of pages. A single IPA gateway review requires evidence against 218 criteria. A nuclear licence application involves documentation spanning 36 licence conditions, each with detailed sub-requirements that cross-reference between documents. A Green Book business case assessment at Full Business Case stage requires evidence across 98 criteria spanning five cases.

These aren't abstract numbers. They represent real documents that someone has to read, assess, and form a judgement about. Not skim. Not summarise. Actually read, against specific criteria, and produce findings that will be scrutinised by reviewers, regulators, and board members who are professionally trained to find gaps.

On Heathrow Terminal 2, the gateway preparation involved hundreds of documents across multiple workstreams. On HS2, the scale was measured in tens of thousands of pages. On nuclear programmes, a single submission package can take a team of experienced professionals weeks to prepare and months for the regulator to assess.

The question isn't whether these documents need to be reviewed. They do. The question is whether the way we review them is fit for purpose.

Why It Matters

The consequences of inadequate document assessment are not theoretical. They are career-defining, programme-defining, and in regulated sectors, safety-defining.

A RED rating at an IPA gateway review doesn't just mean "try again." It means the programme is paused. Additional oversight is imposed. The Senior Responsible Owner's competence is questioned. Stakeholder confidence collapses. The programme team spends months recovering credibility that took years to build.

In tender evaluation, inconsistent assessment across evaluators produces disputed results, legal challenges, and procurement delays measured in months. In nuclear, a missing cross-reference in a safety case delays regulatory approval and carries implications that go well beyond programme timelines.

Every one of these consequences is preventable. Not by doing more review, but by doing better review.

Scenario

The Gateway Surprise

A major programme team spent six weeks preparing for their Gate 3 review. The PMO assembled the document pack, section leads provided their updates, and the team felt confident. They had been through this before.

In the review room, the IPA reviewer identified that three of the five business case areas lacked quantified evidence for benefits realisation. The Benefits Management Strategy existed, but the evidence trail from strategy to individual benefit profiles to realisation plans was incomplete. The team knew the work had been done. But the documents didn't prove it.

The review returned AMBER/RED. The programme wasn't paused, but the conditions imposed consumed the next three months of PMO capacity.

The gaps were fixable. They should have been caught before the reviewer found them.

Scenario

The Inconsistent Evaluation

A public sector body evaluated tenders from four bidders against 42 quality criteria. Five evaluators assessed the submissions independently. When scores were compiled, two bidders were separated by less than 2% overall. But individual evaluator scores for the same submission varied by up to 30 percentage points on identical criteria.

The losing bidder challenged. The evaluation panel couldn't demonstrate that the same criteria had been applied consistently across all evaluators. The procurement was delayed by four months while the evaluation was repeated with additional moderation.

The criteria were clear. The application of those criteria was not.

Scenario

The Missing Cross-Reference

A nuclear licensee submitted documentation supporting a licence condition compliance demonstration. The safety case team had prepared thorough documentation. But a cross-reference between the Safety Assessment Principles response and a supporting technical assessment was incorrectly numbered. The ONR inspector couldn't verify the claim against the source document.

The issue wasn't the quality of the work. It was the traceability of the evidence. A single broken cross-reference created a regulatory query that took three weeks to resolve.

"I've sat in the room when a gateway reviewer reads out findings you didn't know were coming. The silence is unforgettable. Not because the findings were unreasonable, but because they were right — and you should have caught them first."

The Current State

The dominant approach to document assessment on major programmes is manual review by experienced professionals. It works, up to a point. A good reviewer with deep domain knowledge can identify gaps, assess quality, and form judgements that are genuinely valuable.

But it depends entirely on the individual. One reviewer catches what another misses. Fatigue sets in on the third hundred-page document. Implicit knowledge — the experienced reviewer who "knows what good looks like" — is invaluable but inconsistent. It can't be scaled, it can't be replicated, and it can't be audited.

At the other end of the spectrum, organisations are increasingly turning to generic AI tools. Teams paste documents into ChatGPT and ask "is this any good?" They get a confident answer. That answer may be accurate. It may be fabricated. There's no way to tell, because there's no citation, no criteria framework, and no audit trail.

Between these extremes — expensive manual review and unstructured AI — there's a gap. Nothing that reads documents against specific criteria AND produces traceable, evidence-based findings. Nothing that maintains the rigour of expert review while addressing the scale of the problem.

That gap is what this guide addresses.

Chapter 2

Why Current Approaches Fall Short

Organisations facing the document assessment challenge have four options available to them today. Each addresses part of the problem. None addresses all of it.

Manual Expert Review

The gold standard, and for good reason. An experienced professional who has sat through dozens of gateway reviews, who understands the nuances of Green Book methodology, who knows what an ONR inspector is actually looking for — that person provides assessment quality that no technology can fully replicate.

The problems are practical, not conceptual.

Reviewer fatigue is real. Assessment quality degrades on the fourth document of the day. The same reviewer who catches subtle gaps in the morning misses obvious ones in the afternoon. This isn't a failure of competence. It's a failure of scale. Human concentration has limits, and major programme document sets exceed them.

Implicit criteria resist standardisation. Ask three experienced reviewers what "good" looks like for a Benefits Management Strategy, and you'll get three different answers. All valid. All different. The expertise is in their heads, not in a framework that can be consistently applied. When the reviewer changes, the assessment changes.

No systematic coverage verification. A reviewer reads the document from beginning to end and forms a view. But did they check every criterion? Did they verify every cross-reference? The honest answer, for even the most diligent reviewer, is that some things get missed. Not because they aren't competent, but because reading 200 pages against 218 criteria while maintaining perfect recall is not something humans do reliably.

Cost and time. A thorough manual review of a gateway submission by a qualified professional costs tens of thousands of pounds and takes weeks. For a consultancy delivering across multiple engagements, this represents a significant resource commitment on every programme.

The Core Tension

Manual review provides depth and judgement but not consistency or scale. The reviewer who catches the nuanced gap in one document misses the obvious gap in another because they've been reading for six hours straight. The quality is there, but it's unpredictable.

Generic AI Tools (ChatGPT, Copilot, Gemini)

The appeal is obvious. Upload a document, ask the AI to review it, get an answer in seconds. Teams across the UK infrastructure sector are already doing this. Some are doing it well. Most are doing it without understanding the risks.

"AI said it's fine" is not defensible at a gateway review. When an IPA reviewer asks how you verified that your Risk Management Strategy addresses all relevant criteria, "we asked ChatGPT" is not an answer that protects your reputation. There's no audit trail, no criteria framework, and no evidence citation. The AI produced a confident assessment. Whether that assessment is accurate is unknowable, because there's nothing to verify it against.

Hallucination risk is not hypothetical. Research from Stanford's Institute for Human-Centered AI found that legal AI tools hallucinate between 17% and 33% of the time — confidently citing cases that don't exist, references that can't be found, and conclusions that have no basis in the source material. The same problem applies to document assessment. A generic AI tool will confidently tell you that evidence exists for a criterion when it doesn't. It will cite a document section that says something different from what the AI claims. It will fabricate cross-references.

In a domain where findings are scrutinised by professional reviewers and regulators, a hallucinated finding is worse than no finding at all. It creates false confidence. The team believes the evidence is there. It isn't.

No criteria framework. Generic AI assesses against "general quality." It doesn't know that IPA criterion 4.2.3 requires quantified evidence of benefits realisation with baseline measurements. It doesn't know that ONR Licence Condition 14 requires specific arrangements for safety documentation. It assesses against what it thinks "good" looks like, which is an average of everything it has ever read. That's not how assessment works in regulated environments.

No traceability. When a generic AI tool says "the document adequately addresses risk management," you can't verify that claim. Where in the document? Against which criteria? Based on what evidence? The assessment is an opinion, not a finding. In domains where findings must be auditable, opinions are worthless.

The Real Risk

False Confidence

A programme team used a generic AI tool to pre-assess their gateway submission. The AI returned a broadly positive assessment, identifying minor formatting issues but concluding that all major criteria were addressed. The team proceeded with confidence.

The IPA review identified seven criteria where evidence was either missing or insufficient. Three of those gaps were in areas the AI had rated as "adequately covered." The AI hadn't hallucinated — it had simply assessed against the wrong standard. It didn't know what the IPA was looking for, so it assessed against what "good" looks like in general terms.

A wrong GREEN rating is worse than no rating at all.

Compliance Tracking Tools (Diligent, LogicGate)

These are serious platforms built for governance, risk, and compliance. They track status. They manage workflows. They produce dashboards showing which requirements have been addressed and which haven't.

What they don't do is read documents.

Someone still has to open the document, read it against the criteria, and tick a box. The platform tracks whether the box has been ticked. It doesn't verify whether the box should have been ticked. The assessment quality depends entirely on the person doing the reading — which brings us back to the same manual review problems: inconsistency, fatigue, and implicit criteria.

Compliance tools automate the tracking. They don't automate the assessment. For organisations that need to know whether their documents actually address the criteria — not just whether someone has said they do — tracking tools solve the wrong problem.

Spreadsheet-Based Assessment

The default for most programme teams. A criteria matrix in Excel, with columns for each document and rows for each criterion. A reviewer reads the document and enters a RAG rating for each cell. Sometimes with comments. Sometimes without.

It works at small scale. For a team assessing a single document against 20 criteria, a spreadsheet is adequate. For a programme assessing a suite of documents against 218 criteria with cross-references between them, it collapses. Version control becomes impossible. Consolidating assessments from multiple reviewers requires manual reconciliation. There's no way to verify that a GREEN rating in cell D47 is actually supported by evidence in the document.

Spreadsheets track the reviewer's opinion. They don't track the evidence behind it.

The Fundamental Gap

Approach Reads Documents Uses Criteria Cites Evidence Consistent Scalable Auditable
Manual expert review Yes Implicit Sometimes No No Partially
Generic AI Yes No No Partially Yes No
Compliance tools No Yes No N/A Yes Yes (status only)
Spreadsheets No Partially Rarely No No No
Structured AI assessment Yes Yes Yes Yes Yes Yes

The gap is clear. Nothing available today reads documents against specific criteria and produces traceable, evidence-based findings at scale. That's the problem structured AI assessment solves.

Chapter 3

What Good Assessment Actually Looks Like

Before discussing how to build better assessment, we need to define what "better" means. Not in abstract terms, but in the specific, measurable terms that matter when your assessment will be scrutinised by an IPA reviewer, an ONR inspector, or a procurement evaluation panel.

The Assessment Quality Hierarchy

Assessment quality exists on a hierarchy. Most current approaches operate at the lower levels. The goal is to operate consistently at the higher ones.

1
Coverage Have all criteria been addressed? The most basic question. Does the document mention the topic at all?
2
Evidence Is there specific, citable evidence for each criterion? Not just mentioned — demonstrated with traceable text.
3
Quality Is the evidence sufficient and convincing? Does it actually satisfy the criterion, or merely gesture at it?
4
Consistency Would different assessors reach the same conclusion? Can the assessment be reproduced reliably?
5
Traceability Can every finding be verified against source documents? A sceptical reviewer can check every claim.

Most manual review operates consistently at levels 1 and 2. A good reviewer on a good day will reach levels 3 and 4. Level 5 — full traceability where every finding links to specific evidence in source documents — is rare in manual review because the effort required to document every citation is prohibitive at scale.

Structured AI assessment can consistently operate at levels 1 through 4, with human oversight providing the expert judgement required at level 5. The AI reads every document against every criterion, cites specific evidence, scores deterministically, and produces consistent results regardless of time of day or number of documents already reviewed. The human reviews the AI's findings, applies expert judgement where the AI flags uncertainty, and makes the final decisions.

Four Properties of Quality Assessment

Specific

A quality assessment doesn't say "the risk management approach is adequate." It says "IPA criterion 3.4.1 (Risk Management Strategy) is addressed in Section 4.2 of the Programme Risk Management Plan, paragraphs 4.2.3-4.2.7, which describes the risk identification process, risk categorisation approach, and escalation thresholds. However, no quantified risk appetite statement was found against criterion 3.4.2." Specificity is the difference between an opinion and a finding.

Traceable

Every finding must link to evidence in the source documents. "The Benefits Management Strategy addresses benefit measurement" is an opinion. "Section 3.1 of the Benefits Management Strategy (page 12, paragraphs 3-4) describes baseline measurement methods for three of the programme's seven primary benefits" is a finding. A reviewer can turn to page 12 and verify the claim. If they can't verify it, the finding is worthless.

Consistent

The same documents assessed against the same criteria should produce the same results, regardless of who — or what — is doing the assessment. This is the consistency that manual review cannot guarantee and that generic AI does not attempt. When assessment results change because a different person read the document, or because the AI was asked on a different day, the assessment process has failed.

Comprehensive

Every criterion must be assessed. Not most criteria. Not the important ones. Every single criterion. A gateway reviewer will check the ones you skipped. A nuclear regulator will identify the licence condition you assumed was covered. Comprehensive assessment means zero gaps in coverage, even if some criteria are assessed as "not applicable" or "insufficient evidence found."

"The standard isn't 'does the AI do as well as a human?' The standard is 'does the AI plus the human do better than the human alone?' If the answer is yes — and it is — then the question is how to structure that collaboration to get the best from both."

The Human + AI Model

The goal of structured AI assessment is not to replace expert reviewers. It's to make them better.

Consider the division of labour. AI is good at reading every word of every document against every criterion without fatigue. It's good at systematic coverage checking, citation verification, and consistent scoring. Humans are good at interpreting ambiguous evidence, assessing quality of argument, understanding organisational context, and making judgement calls.

In the human + AI model:

The AI handles the reading. The expert handles the decisions. Neither replaces the other. Together, they operate at assessment quality levels that neither achieves independently.

This isn't a compromise. It's the intended design. The best possible outcome for document assessment is a structured AI system doing the systematic work, with experienced professionals applying the judgement that only experience can provide.

Part Two

The Method

The principles behind structured AI assessment, how the multi-step agentic pipeline with deterministic guardrails works, and why accuracy matters more than speed.

Chapter 4

The Principles of Structured AI Assessment

The most common mistake in applying AI to document assessment is focusing on the AI. The AI is the mechanism. What matters is the structure.

An unstructured AI reading a document produces an unstructured response. A structured AI reading a document against a defined criteria framework produces a structured assessment. The difference is not in the AI's capability. It's in the methodology wrapped around it.

Eight principles underpin effective structured AI assessment. Each addresses a specific failure mode that occurs when AI is applied without discipline.

Principle 1: Criteria-First, Not Document-First

The natural instinct when reviewing a document is to start reading it and form a view. This is how experienced reviewers work, and it relies on their accumulated knowledge of what "good" looks like.

Structured AI assessment starts from the other direction. Start from what you're looking for, not from what's in front of you. Define the criteria. Define what evidence would satisfy each criterion. Then search the documents for that evidence.

This inversion is fundamental. A document-first approach asks "what does this document say?" A criteria-first approach asks "does this document provide evidence for criterion X?" The first approach is open-ended and subjective. The second is specific and testable.

When an IPA reviewer assesses a gateway submission, they have a criteria framework. They assess the submission against it. Structured AI assessment does the same thing, but systematically and completely.

Principle 2: Sub-Question Decomposition

Complex criteria cannot be assessed as single questions. "Does the business case demonstrate value for money?" is not a question that can be answered in one pass. It must be decomposed into specific, testable sub-questions.

Each criterion is broken into components that can be individually verified. Each sub-question produces a three-way answer — Yes, Partial, or No — with a percentage mapping that feeds directly into the scoring:

One criterion becomes eight testable questions. The aggregate percentage across all sub-questions produces the coverage score for that criterion: GREEN at 70% or above, AMBER between 40-69%, RED below 40%. This is not a rough guide. It is the actual formula the system applies. The score is arithmetic, not opinion.

This is the difference between "the business case adequately demonstrates value for money" and "the business case scores 62% on this criterion (AMBER): quantified costs and benefits are present with NPV calculation, but sensitivity analysis is partial and optimism bias adjustments are absent." The second is useful and verifiable. The first is an opinion.

Principle 3: Evidence Citation

Every finding must link to specific text in source documents. Not "Section 4 discusses risk management." Not "the document addresses this topic." But "Section 4.2, paragraphs 3-5, page 14: 'The programme has established a Risk Management Strategy that defines risk appetite as...'"

This principle exists because of how assessment findings are consumed. A gateway reviewer who sees a finding will want to verify it. An SRO preparing for a board meeting needs to know exactly where the evidence is. A nuclear regulator will check every citation. If the citation is vague, the finding is treated as unverified.

Evidence citation also addresses the hallucination problem directly. When the AI must cite specific text, fabricated findings become immediately identifiable. The citation either points to real text in the document or it doesn't. There's no ambiguity.

Principle 4: Deterministic Scoring with Rules-Based Guardrails

Assessment scores should be determined by the evidence found, not by probabilistic guesses. The coverage percentage from sub-questions (Yes = 100%, Partial = 70%, No = 0%) produces a deterministic score. That score is then checked by rules-based guardrails that override the LLM's own rating when the evidence says something different from what the model concluded.

This is the principle that separates structured AI assessment from every other AI tool on the market. The system does not trust the LLM's rating. It calculates its own from sub-question evidence and overrides when they disagree.

Deterministic scoring eliminates the "AI confidence problem." Generic AI tools produce a confidence score that represents how certain the model is about its answer. That's a measure of the AI's internal state, not a measure of the evidence. A deterministic score based on evidence found is objective and verifiable.

The scoring methodology is transparent. RAG ratings (Red, Amber, Green) are the native language of programme assurance, and the thresholds are explicit:

Then four guardrail rules apply, overriding the LLM's rating when the evidence contradicts it. These guardrails catch the most common failure mode of LLM-based assessment: conservatism. An LLM that rates everything AMBER because it's hedging is corrected when the sub-question evidence clearly supports GREEN. Chapter 5 describes these guardrails in detail.

Principle 5: Confidence Calibration and Self-Critique

The system must know when it's uncertain. An assessment that's confident about everything is an assessment that can't be trusted.

When the system's confidence in a finding drops below 70%, it triggers an automated self-critique: three targeted questions that probe whether the rating is justified. The self-critique can adjust the rating up or down based on the answers. This isn't a human step. The system questions its own work before presenting it.

Beyond self-critique, the system explicitly flags when:

These are the areas where human review is most valuable. Instead of reading every page of every document, the expert reviewer focuses on the areas where the AI has flagged uncertainty. This is how AI augments expertise rather than replacing it.

Principle 6: Multi-Model Orchestration

Not every task in the assessment pipeline requires the same AI model. The highest-capability model (GPT-4o) handles scoring, summary generation, and adversarial debate — the tasks where nuance matters most. A lighter model (GPT-4.1-mini) handles retrieval quality checks, citation verification, and self-critique — tasks that require accuracy but not the same depth of reasoning. A specialised embedding model (text-embedding-3-small) handles semantic search.

This isn't cost optimisation. It's quality optimisation. Each model is used where its strengths apply. The scoring model runs at temperature 0.0 for deterministic output. The retrieval model runs with broader tolerance to catch evidence that uses different terminology. Using the wrong model for a task doesn't just waste money — it degrades assessment quality.

Principle 7: Incremental Assessment

Documents change. A programme team updates the Benefits Management Strategy, uploads a new version, and needs to know how the assessment has changed. Running the entire assessment from scratch every time a document changes is wasteful and slow.

Incremental assessment tracks the evidence lineage: which criteria used which chunks of which documents. When a document changes, only the criteria that relied on the changed content are re-assessed. Unaffected criteria carry forward their previous results. This reduces re-assessment time from 30-60 minutes to minutes, depending on the scope of changes.

This principle enables the continuous assessment model described in Chapter 11. Without evidence lineage, continuous assessment would mean running the full pipeline on every document change. With it, assessment updates happen in near-real-time as documents evolve.

Principle 8: Human Oversight

AI augments expert judgement. It never replaces it. This isn't a caveat or a disclaimer. It's a design principle.

The AI produces an assessment. A human reviews that assessment. The human can override any finding, add context the AI couldn't capture, and make final judgements on ambiguous evidence. The AI's output is a starting point for expert review, not a final answer.

This principle protects against the most dangerous failure mode in AI assessment: over-reliance. When organisations treat AI output as a final answer rather than a first pass, they lose the expert judgement that makes assessment valuable. The whole point is to free experts from the mechanical work of reading and finding, so they can focus on the intellectual work of interpreting and deciding.

Chapter 5

The Assessment Pipeline

The principles from Chapter 4 are implemented through a multi-step agentic pipeline with deterministic guardrails. This is not a simple sequence of prompts to an LLM. It's a 10-step orchestration layer that manages crash recovery, incremental re-assessment, and progress streaming, wrapping a per-criterion assessment loop that itself contains up to six agentic steps depending on the assessment tier configured.

I'm going to describe the actual architecture. Not a simplified version. Not a diagram with boxes and arrows that hides the complexity. The real thing, because the real thing is what earns trust.

The Orchestration Layer

Before a single criterion is assessed, the orchestration layer handles the infrastructure that makes assessment reliable at scale. These steps aren't glamorous. They're essential.

  1. Duplicate run prevention. A 10-minute window check ensures that the same assessment can't be triggered twice. On a programme where multiple people have access, this prevents conflicting runs and wasted compute.
  2. Stale run resumption. If a previous run crashed — server restart, timeout, network failure — the system detects it and reclaims the stalled run rather than starting over. Crash recovery, not crash restart.
  3. Create or resume assessment run record. Each run gets an atomic version number and a snapshot of the engine configuration (which criteria framework, which assessment tier, which thresholds). This means the assessment output is fully reproducible: you can look at run #7 and know exactly what configuration produced it.
  4. Initialise progress tracker. The system opens a Server-Sent Events (SSE) stream to the UI, so the user sees each criterion being assessed in real time. Not a spinner. Not "processing." Actual progress: "Assessing criterion 47 of 218: Benefits Management Strategy."
  5. Update project status. The project moves to "processing" state, preventing document uploads or configuration changes mid-assessment.

After the per-criterion loop completes:

  1. Generate summary. The system produces a Delivery Confidence Assessment (DCA) and executive summary (described in detail below).
  2. Finalise run. The assessment is marked complete, results are saved, status resets, and a consistency check runs to verify that every criterion has a result.
  3. Error handling. If the run fails at any point, it's marked as failed and the project resets to draft state for retry. No half-finished assessments reach the user.

This orchestration exists because assessment at scale breaks without it. A programme submitting 218 criteria against a suite of documents can't afford an assessment that fails silently at criterion 147 and presents incomplete results as if they were complete.

The Per-Criterion Assessment Loop

The orchestration layer calls the per-criterion loop once for every criterion in the framework. For IPA, that's 218 iterations. Each iteration is self-contained: it can succeed, fail, or time out without affecting other criteria.

Before running the full assessment on each criterion, the system checks whether the work has already been done.

Carry-Forward: Don't Re-Assess What Hasn't Changed

When documents change between assessment runs, not every criterion needs re-assessment. The system tracks evidence lineage — which criteria used which chunks from which documents. If a criterion's source evidence hasn't changed since the last run, its results carry forward unchanged. Only criteria affected by document changes are re-assessed.

This is what makes iterative assessment practical. A programme team updates one document in a suite of twenty. The system re-assesses the 15 criteria that relied on evidence from that document. The other 203 carry forward in seconds.

For criteria that do need assessment, the pipeline splits into two tiers: Standard and Extended.

Standard Tier: Retrieval + Scoring

Every criterion goes through the Standard Tier. This is the core of the assessment — two agentic steps that find the evidence and produce a rating.

Step 1: Retrieval

The system doesn't perform a single search. It runs a hybrid retrieval process:

The output: 15 ranked evidence chunks, each with a relevance score, source document, and location within that document. The system retrieves generously — 15 chunks rather than 3 or 5 — because missing evidence at this stage can't be recovered later.

Timeouts: 300 seconds for standard criteria, 480 seconds for extended criteria that require deeper analysis. These are long timeouts because thorough retrieval across a large document suite takes time. This is a deliberate design choice. Speed without thoroughness is speed without value.

Step 2: Scoring

GPT-4o processes the retrieved evidence at temperature 0.0 (fully deterministic) and produces structured output:

Temperature 0.0 is critical. At higher temperatures, the model introduces variability — the same evidence might produce slightly different ratings on different runs. At 0.0, the same input always produces the same output. Reproducibility is not optional in assessment that will be scrutinised by reviewers and regulators.

THE DETERMINISTIC GUARDRAILS

After the LLM produces its rating, the system doesn't trust it. It calculates its own rating from the sub-question evidence and overrides the LLM when they disagree.

The coverage calculation is arithmetic, not probabilistic. Each sub-question is scored: Yes = 100%, Partial = 70%, No = 0%. The average across all sub-questions produces the coverage percentage. The RAG thresholds are fixed: GREEN ≥70%, AMBER 40-69%, RED <40%.

Then four guardrail rules fire:

Guardrail A: If the evidence shows GREEN coverage with 20% or fewer negative findings and only AMBER delivery concerns → force the rating to GREEN. The LLM was being conservative. The evidence supports a stronger rating.

Guardrail B: If the evidence shows GREEN coverage but RED delivery concerns → force AMBER. The evidence exists but there are serious questions about how it's being implemented.

Guardrail C: If the evidence shows RED coverage but the LLM rated GREEN on delivery → force AMBER. You can't deliver well what you haven't demonstrated.

Guardrail D: If the retrieval quality was poor but the LLM rated GREEN on delivery → force AMBER. If we couldn't find good evidence, we can't be confident in a strong rating.

These four rules are the most important lines of code in the entire system. They mean a programme team can never receive a GREEN rating that isn't supported by quantified evidence. And they mean an LLM that's hedging toward AMBER when the evidence clearly supports GREEN gets corrected. The guardrails work in both directions. They're not cautious. They're accurate.

Worked Example: Guardrail A in Action

Criterion: "The Benefits Management Strategy identifies individual benefits with baseline measurements."

Sub-questions and evidence:

  1. Does a Benefits Management Strategy exist? — Yes (100%)
  2. Are individual benefits identified? — Yes (100%)
  3. Do benefits have assigned owners? — Yes (100%)
  4. Are baseline measurements stated? — Partial (70%) — baselines stated for 5 of 7 benefits
  5. Is a benefits realisation timeline provided? — Yes (100%)
  6. Are measurement methods described? — Yes (100%)

Coverage calculation: (100 + 100 + 100 + 70 + 100 + 100) / 6 = 95% → GREEN

LLM rated: AMBER (noting the two missing baselines)

Guardrail A fires: GREEN coverage (95%), one partial finding (well under 20% negative), no delivery concerns. Rating overridden to GREEN.

The LLM was being conservative about two missing baselines. The evidence overwhelmingly supports GREEN. A human reviewer looking at 5 of 7 baselines complete would rate this GREEN too.

Extended Tier: Four Additional Steps

The Extended Tier adds four steps to the assessment loop. These steps exist because Standard Tier assessment, while accurate, has known failure modes. The Extended Tier addresses each one.

Step 3: CRAG (Corrective Retrieval-Augmented Generation)

After scoring, the system classifies the retrieval quality for that criterion as good, ambiguous, or poor. If retrieval was ambiguous or poor — meaning the evidence found was weak, off-topic, or sparse — the system doesn't accept the result. It tries again.

CRAG decomposes the original criterion into sub-queries (specific aspects of the criterion that might surface evidence missed by the first retrieval) and runs up to two additional retrieval rounds. If new evidence is found, the criterion is re-scored with the expanded evidence set.

This matters because documents use inconsistent terminology. A first retrieval pass might miss evidence for "risk appetite" because the document calls it "risk tolerance framework." CRAG generates a sub-query for "risk tolerance framework" and finds the evidence.

Step 4: Citation Verification

This is the step that addresses the 17-33% hallucination rate found in other AI tools. Every evidence quote cited in the scoring output is checked against the source chunks. Does the quoted text appear verbatim in the source document? If not, it's flagged or removed.

This step uses GPT-4.1-mini rather than GPT-4o — a deliberate choice. Citation verification is a precision task (does this text match?) not a reasoning task (what does this evidence mean?). The lighter model handles it accurately at lower cost and latency.

Why This Step Is Non-Negotiable

An IPA reviewer can verify any citation in your assessment within minutes. An ONR inspector will verify every citation. A procurement evaluator will spot-check citations when scores are close. If a single citation points to text that doesn't support the finding, the credibility of the entire assessment is compromised. Citation verification isn't a nice-to-have. It's the foundation of trust.

Step 5: Self-Critique

If the system's confidence in its rating is below 70%, it triggers a self-critique cycle. Three targeted questions probe whether the rating is justified:

Based on the answers, the self-critique can adjust the rating up or down. This catches the cases where the LLM scored conservatively because the evidence was phrased differently from what it expected, or where the LLM scored generously because it over-interpreted weak evidence.

The self-critique is a system questioning its own work. That sentence is worth pausing on. Most AI tools present their first answer as the final answer. This system treats its first answer as a hypothesis to be tested.

Step 6: Adversarial Debate

This step fires only for AMBER boundary cases — criteria rated AMBER where the evidence is close to a GREEN or RED threshold. The system doesn't debate criteria that are clearly GREEN or clearly RED. It focuses its most expensive analytical tool on the cases that matter most.

The debate is multi-persona and adversarial. For IPA assessments, three virtual IPA specialist personas argue the case. For other frameworks, two general assessment personas take opposing positions. One argues the evidence supports a higher rating. The other argues it doesn't. The exchange is stored as a signal alongside the criterion result, giving the human reviewer direct visibility into the contested reasoning.

This is not theatre. Boundary AMBER cases are where assessments fail in practice. A programme team rated AMBER when the evidence supported GREEN loses confidence in the tool. A programme team rated GREEN when the evidence was genuinely AMBER loses confidence in their readiness. The debate step forces the system to stress-test the cases where the difference matters most.

Summary and DCA Generation

After all criteria are assessed, the system generates the summary that stakeholders will actually read. This isn't a simple aggregation of RAG ratings.

Dimension grouping. Criteria are grouped by dimension — for IPA, the five cases (Strategic, Economic, Commercial, Financial, Management). For other frameworks, the dimensions are defined by the module configuration. Each dimension gets its own summary narrative, generated from the distribution of RAG ratings within that dimension.

Showstopper detection. The system checks for blocker criteria — criteria that, if rated RED, should prevent an overall GREEN or AMBER assessment regardless of other results. A RED rating on a safety-critical criterion in nuclear assessment, for example, overrides an otherwise positive picture.

Delivery Confidence Assessment. The DCA combines a rules-based preliminary calculation (from the distribution of RAG ratings across dimensions) with GPT-4o's holistic professional judgement. The rules provide the floor. The LLM provides the nuance. Neither alone is sufficient.

Executive summary. The final output includes a verdict headline, a narrative explaining the assessment outcome, key strengths identified, critical issues requiring attention, and a specific recommendation. The format is board-ready by design: one page that tells the SRO whether to worry, supported by the full evidence trail for anyone who wants to verify.

Incremental Assessment: Evidence Lineage

The system maintains an evidence_links table that tracks which criteria used which document chunks. This evidence lineage enables two capabilities that fundamentally change how assessment works in practice:

Targeted re-assessment. When a document changes, the system identifies which criteria relied on evidence from the changed sections. Only those criteria are re-assessed. Everything else carries forward. A 218-criterion assessment that took 45 minutes on first run might take 8 minutes on a re-run after one document changes.

Impact visibility. Before re-running the assessment, the system can show which criteria will be affected by a document change. The programme team knows what to expect before they trigger the re-assessment. No surprises.

This is what makes the write-assess-improve cycle from Chapter 8 practical. Without incremental assessment, each iteration of a document would require a full assessment run. With it, the feedback loop is fast enough to be part of the drafting process.

Three LLM Tiers

The pipeline uses three models, each chosen for the task it performs best:

Model Tasks Why This Model
GPT-4o Scoring, summary generation, adversarial debate Maximum reasoning capability for the highest-stakes decisions
GPT-4.1-mini CRAG, citation verification, self-critique Accurate for precision tasks without the cost or latency of the full model
text-embedding-3-small Vector embeddings for semantic search Purpose-built for embedding quality, not generation

This isn't cost optimisation dressed up as architecture. Using GPT-4o for citation verification would be slower and no more accurate. Using GPT-4.1-mini for adversarial debate would produce shallower analysis. The right model for the right task produces better assessment, not just cheaper assessment.

Chapter 6

Accuracy Over Speed

The marketing claim for AI assessment is speed. "Review documents in seconds." "Instant compliance checks." "Upload and get results immediately."

The buying decision, however, is not about speed. It's about trust.

Speed is the headline. Accuracy is the line item that matters when you're deciding whether to brief your SRO based on the output. When you're deciding whether to include the assessment in a regulatory submission. When you're deciding whether your professional reputation stands behind the findings.

The Counterintuitive Argument

A thorough structured AI assessment of a major document suite takes 30-60 minutes. Not 30 seconds. Not instant.

This is a feature, not a limitation.

In those 30-60 minutes, the system is running hybrid semantic and keyword retrieval across multiple documents, expanding queries into 2-3 variants per criterion, fusing results with Reciprocal Rank Fusion, scoring every criterion with GPT-4o at temperature 0.0, applying deterministic guardrails that override the LLM when the evidence disagrees, verifying every citation against source text, triggering self-critique on low-confidence findings, running adversarial debate on boundary AMBER cases, and generating a board-ready report with full evidence trails. It's doing what would take a team of reviewers weeks to do manually.

A 30-second assessment that confidently tells you "everything looks fine" should make you nervous, not reassured. If the system didn't have time to search properly, it didn't search properly. Speed without thoroughness is speed without value.

"The organisations I work with don't need faster assessment. They need assessment they can trust. Speed matters — hours instead of weeks is transformative. But seconds instead of hours? That's a warning sign, not a selling point."

What Accuracy Means in Document Assessment

Accuracy in AI assessment has four dimensions. Each matters independently, and each requires different engineering to achieve.

Precision: Findings Are Correct

When the system says evidence was found, evidence was actually found. When it says a gap exists, a gap actually exists. False positives (claiming evidence exists when it doesn't) create dangerous false confidence. False negatives (claiming gaps exist when evidence is present) waste reviewer time and erode trust in the system.

Precision is achieved through three layered mechanisms. The deterministic guardrails override the LLM's rating when the sub-question evidence says something different — the system doesn't trust the model's judgement, it calculates its own. Citation verification checks that every quoted evidence passage actually appears verbatim in the source document, catching the 17-33% hallucination rate that other tools suffer from. And self-critique triggers on low-confidence findings, questioning whether the evidence really supports the rating before it reaches the output.

Recall: Gaps Are Genuinely Gaps

When the system says "no evidence found," that should mean the evidence genuinely isn't in the documents — not that the system failed to find it. Recall problems occur when evidence exists but uses different terminology, is in an unexpected location, or is spread across multiple sections.

Recall is addressed through hybrid retrieval (semantic + keyword search with Reciprocal Rank Fusion), query expansion into 2-3 variants per criterion, and CRAG. If the initial retrieval was poor — weak, sparse, or off-topic results — CRAG decomposes the criterion into sub-queries and tries again, up to two additional rounds. The system doesn't accept "no evidence found" on the first attempt. It actively works to find evidence before concluding it isn't there.

Citation Accuracy: References Point to Real Evidence

Every citation in the assessment output must point to real text in the source documents that says what the assessment claims it says. This is the most directly verifiable dimension of accuracy, because any reviewer can check any citation.

Citation accuracy of less than 100% is not acceptable. One fabricated citation destroys confidence in all the others. The citation verification step checks every quoted passage against the source chunks verbatim. Quotes that can't be verified are flagged or removed before they reach the output. This is why the system uses a dedicated verification step with GPT-4.1-mini rather than relying on the scoring model to get its citations right first time.

Scoring Consistency: Same Documents, Same Results

Running the same assessment on the same documents should produce the same scores. Variability in scoring undermines the entire proposition. If the score changes depending on when the assessment was run, the assessment process is unreliable.

Consistency is achieved through two mechanisms working together. GPT-4o runs at temperature 0.0, eliminating the randomness that other tools introduce through higher temperature settings. Then the deterministic guardrails apply fixed arithmetic rules to the sub-question evidence: same evidence percentages always produce the same RAG threshold, regardless of any residual model variation. The scoring is doubly deterministic — the model is pinned to its most consistent output, and the rules override it when consistency requires it.

Boundary Accuracy: AMBER Cases Get Extra Scrutiny

The most consequential accuracy failures happen at rating boundaries. A criterion that's borderline GREEN/AMBER will define whether the programme team relaxes or mobilises. Generic AI tools handle boundary cases the same as clear-cut cases. This system deploys adversarial debate specifically on AMBER boundary cases — multiple specialist personas arguing whether the evidence supports a higher or lower rating. The debate doesn't change the score directly. It provides the human reviewer with the contested reasoning, so they can make the final call with full visibility into why the boundary case is a boundary case.

The Cost of False Confidence

The most dangerous outcome in document assessment is not a missed gap. It's a false GREEN.

A missed gap leaves the programme unaware. But a false GREEN actively misleads. The team believes the criterion is addressed. They brief the SRO that the submission is ready. The SRO tells the programme board. And then the gateway reviewer finds a gap that the team was specifically told didn't exist.

False confidence does more damage than no assessment at all. Without assessment, the team knows they're uncertain. With a false GREEN, the team is certain they're ready — and wrong.

This is why accuracy takes priority over speed, and why every step in the pipeline exists to verify findings before they reach the output. The processing time is the cost of reliability. And reliability is the only thing that matters when your reputation is on the line.

The Trade-Off Made Explicit

Structured AI assessment trades processing time and computational cost for accuracy. Here's what that means in practice:

What You Get What It Costs
Every criterion assessed individually 30-60 minutes per assessment run
Every citation verified against source Higher computational cost per assessment
Confidence calibration on every finding More complex output requiring review
Deterministic, reproducible scores No "instant" results
Board-ready PDF with full evidence trail Not a chatbot-style Q&A experience

Every organisation I've worked with has made this trade-off willingly. Because the alternative — fast but unreliable assessment that can't be trusted — isn't actually useful. It's just fast.

Part Three

Application

How structured AI assessment applies across domains, on both sides of the assessment table, and why configurability matters.

Chapter 7

Use Cases Across Domains

The principles and pipeline described in Part Two are domain-agnostic. The value is domain-specific. Each application area has its own criteria frameworks, its own document conventions, its own stakeholder expectations, and its own consequences for getting it wrong.

Gateway Review Preparation

The IPA gateway review process is the primary assurance mechanism for major UK government programmes. Reviews are conducted at key decision points — from Strategic Outline Case through to Operations Review and Benefits Realisation. Each review assesses the programme against a structured criteria framework, and the outcome is a confidence rating that directly affects whether the programme proceeds.

The ratings are blunt. GREEN: successful delivery appears likely. AMBER: successful delivery appears feasible but significant issues require management attention. AMBER/RED: successful delivery is in doubt, with major risks or issues apparent. RED: successful delivery appears to be unachievable.

For the programme director, a RED rating is a career event. The programme is paused. Additional oversight is imposed. The SRO's judgement is questioned. The narrative shifts from "on track" to "in trouble," and recovering from that narrative takes months.

How structured AI assessment transforms preparation:

Instead of the PMO spending weeks reading documents and compiling a readiness assessment, the document suite is assessed against the IPA's criteria framework in hours. The output identifies exactly which criteria are covered, where the evidence is, and where the gaps are. The PMO can then focus effort on closing gaps rather than searching for them.

The critical value is visibility. Before the reviewer arrives, the programme team knows what the reviewer will find. No surprises in the room.

The Value Proposition

A gateway review costs the programme time, money, and attention regardless of the outcome. The difference between an AMBER and a RED — between "manageable issues" and "programme paused" — is often a handful of gaps that were fixable if visible. Structured AI assessment makes them visible.

Tender Evaluation

Bid assessment has a consistency problem. When five evaluators assess four submissions against 42 criteria, you get 840 individual assessments. The quality of each depends on the evaluator's interpretation of the criterion, their reading of the submission, and their scoring approach. Moderation catches the most obvious inconsistencies, but systematic consistency across 840 assessments is beyond what moderation alone can deliver.

Structured AI assessment addresses this directly. Every submission is evaluated against exactly the same criteria, decomposed into the same sub-questions, with evidence cited from the same locations. The consistency isn't approximate. It's structural.

Cross-bidder comparison becomes evidence-based rather than score-based. Instead of "Bidder A scored 72% and Bidder B scored 68%," the output shows "Bidder A provided quantified evidence for risk management approach (Section 4.3, paragraphs 2-4), while Bidder B's risk section referenced a risk register without describing the management approach." Evaluators can see exactly where the differences lie, not just that they exist.

For procurement teams facing challenge risk, this traceability is invaluable. Every score can be traced back to specific evidence in the submissions. A losing bidder can't challenge consistency when the assessment framework is demonstrably identical across all submissions.

Nuclear Regulatory Compliance

Nuclear assessment operates in a different regulatory environment from other infrastructure. The stakes are non-negotiable. ONR licence conditions are not guidance — they are legal requirements. The Safety Assessment Principles (SAPs) provide the framework against which safety cases are judged. Getting this wrong isn't a programme delay. It's a regulatory hold that can cost years.

There are 36 licence conditions, each with specific requirements for documentation, arrangements, and evidence. The Licence Condition handbook (LC handbook) details what the regulator expects to see. Safety Assessment Principles number in the hundreds, covering fault analysis, radiation protection, and safety management systems.

Traceability is non-negotiable. An ONR inspector will follow every citation. They will check that the safety case document says what the assessment claims it says. They will verify that cross-references between documents are accurate. Any broken link in the evidence chain creates a regulatory query, and regulatory queries consume months.

Structured AI assessment provides the level of traceability that nuclear demands. Every finding cites specific text. Every cross-reference is verified. Every gap is documented with sufficient detail for the licensee to take remedial action. The assessment doesn't replace the regulatory process. It prepares the licensee to meet it.

Business Case Assessment (Green Book)

HM Treasury's Green Book provides the framework for appraisal and evaluation of policies, projects, and programmes. Business cases follow the Five Case Model: Strategic, Economic, Commercial, Financial, and Management cases. At each stage — Strategic Outline Case, Outline Business Case, Full Business Case — the documentation requirements increase in depth and specificity.

The common gaps in business cases are predictable. Quantification of benefits that remains qualitative. Optimism bias adjustments that reference the guidance but don't show the calculation. Sensitivity analysis that tests one variable but not the ones that matter. Risk assessment in the Financial Case that doesn't link to the risk assessment in the Management Case.

These gaps exist not because teams don't know about them, but because the volume of requirements makes it easy to lose track. A Full Business Case assessment against Green Book criteria involves 98 individual checkpoints. Missing one is easy. Missing one that the Treasury or IPA reviewer catches is expensive.

Structured AI assessment catches the gaps that humans miss through volume fatigue. It doesn't judge whether the sensitivity analysis tests the right variables — that's expert judgement. But it ensures that a sensitivity analysis exists, that it references the economic model, and that it addresses assumptions identified as uncertain elsewhere in the business case.

Construction Compliance (CDM)

The Construction (Design and Management) Regulations 2015 impose documentation requirements on duty holders: the client, principal designer, and principal contractor. Documentation must demonstrate that duty holder responsibilities are being discharged. The Health and Safety File must be maintained throughout the project.

CDM compliance assessment is often done at the end of a phase, when someone remembers it needs to be done. Structured AI assessment enables continuous monitoring — assessing CDM documentation against the 48 criteria on an ongoing basis rather than as a retrospective exercise. Gaps are identified when they can be closed, not when the HSE inspector arrives.

The specific value is in completeness checking. CDM documentation requirements span multiple duty holders and multiple project phases. Ensuring that every requirement is addressed by someone, at some point, in some document, is a tracking challenge that structured assessment handles systematically.

Contract Compliance (NEC)

NEC4 contracts rely on rigorous procedural compliance. Early warnings, compensation events, programme submissions, and payment applications all have contractual timescales and documentation requirements. A missed early warning that should have been raised under clause 15.1 becomes a compensation event dispute. A programme submission that doesn't meet clause 31.2 requirements can be rejected.

NEC compliance assessment reviews contract administration documentation against the contractual requirements. Are early warnings being raised within the required timeframes? Do programme narratives meet the specified level of detail? Are compensation event quotations supported by the required substantiation?

For programme teams managing dozens of compensation events across multiple contracts, systematic assessment ensures that procedural compliance doesn't slip through the volume.

Chapter 8

Both Sides of the Table

This is the chapter that no other publication in this field addresses. Every discussion of AI-powered assessment focuses on one side: the assessment of documents after they're written. But the same methodology transforms how documents are prepared in the first place.

Anyone who has prepared a major document submission and received a formal assessment knows the experience from both sides. The frustration of writing a document you believed was thorough, only to discover it missed criteria you didn't know were there. The frustration of assessing a document that clearly represents months of work but doesn't address the criteria it was supposed to address.

Structured AI assessment changes both experiences.

Preparing Document Submissions

The traditional approach to document preparation is write, review internally, submit, and hope. The internal review catches what it catches, limited by the same inconsistency and coverage gaps described in Chapter 2. The team submits with varying degrees of confidence, ranging from "we think this is good" to "we've run out of time."

Assessment-informed preparation inverts this. Knowing the assessment criteria changes how you write documents from the very beginning.

The write-assess-improve cycle:

  1. Draft the document against the criteria framework (not in a vacuum)
  2. Run AI assessment against the relevant criteria at the end of each major section
  3. Identify gaps early while the drafting team still has context and momentum
  4. Improve and re-assess until coverage meets the required threshold
  5. Submit with confidence because you've already been assessed against the same criteria the reviewer will use

This cycle changes the economics of document preparation. Instead of discovering gaps at the end — when fixing them means reopening completed sections, re-engaging subject matter experts, and delaying submission — gaps are discovered during drafting when they're easiest and cheapest to address.

Making the assessor's job easier. Documents prepared with assessment criteria in mind are structured differently. Evidence is clearly signposted. Cross-references are precise. Section headings map to criteria areas. The document doesn't just contain the right information — it presents it in a way that an assessor can verify.

"The best document submissions I've seen were written by people who understood exactly what the reviewer was looking for. Not because they were gaming the system, but because they structured their evidence to be verifiable. That's not manipulation. That's good practice."

Receiving and Reviewing Submissions

The other side of the table has different challenges. A reviewing body — whether an IPA review team, a procurement evaluation panel, or an internal assurance function — needs to assess submissions thoroughly, consistently, and within a timeframe that often feels inadequate.

First-pass filtering. Structured AI assessment serves as a first-pass filter before expert review. The AI assessment identifies which documents need deep expert attention and which are clearly adequate. Instead of reading everything at the same depth, the expert reviewer can focus time where it adds the most value: on the ambiguous areas, the uncertain findings, and the criteria where AI confidence is low.

This doesn't reduce the rigour of the review. It redirects the rigour to where it matters most. An expert spending three hours on areas of genuine uncertainty produces better assessment than the same expert spending three hours reading documents that are clearly adequate.

Quality assurance of your own review. After completing a manual review, running the structured AI assessment as a cross-check ensures coverage. Did the manual review address all 218 criteria? Did it check the cross-references the AI identified as broken? Were there evidence gaps the manual reviewer missed? The AI assessment becomes a quality assurance mechanism for the review itself.

Pre-moderation. In tender evaluation, where moderation of scores across evaluators is critical, AI assessment provides a baseline. Evaluators can see the AI's evidence-based assessment alongside their own. Where scores differ significantly, the evidence trail makes it clear whether the difference is in interpretation (legitimate) or in coverage (a gap to address).

The Feedback Loop

When organisations use structured AI assessment on both sides of the table — to prepare submissions and to review them — a feedback loop emerges that improves document quality over time.

Organisations that run AI assessment before submission learn, programme by programme, what "good" looks like for each criteria area. They build institutional knowledge about common gaps, effective evidence structures, and optimal document formats. This knowledge compounds. Each submission is better than the last, not because the people change, but because the methodology captures and applies what's been learned.

This is the real competitive advantage. Not the speed of a single assessment, but the systematic improvement of document quality across an organisation. The firms and programmes that adopt structured AI assessment earliest will build this institutional knowledge first. The gap between those who use it and those who don't will widen with every assessment cycle.

Chapter 9

The Configurable Engine Approach

Every domain described in Chapter 7 has its own criteria framework. IPA gateways use one set. Green Book business cases use another. ONR has its own. CDM, NEC, DCO — each has its own assessment standard. And beyond published frameworks, every consultancy has its own methodology, its own quality criteria, its own proprietary approach.

A one-size-fits-all assessment tool is a tool that fits none of these properly.

Why Configurability Matters

The value of structured AI assessment comes from assessing against specific criteria. The criteria are what make the assessment meaningful. An assessment against generic quality standards tells you very little. An assessment against the 218 IPA gateway criteria tells you exactly where you stand.

But the criteria change. IPA updates its guidance. ONR revises the licence condition handbook. HM Treasury updates the Green Book supplementary guidance. A consultancy evolves its methodology. Any assessment platform that hardcodes criteria will be out of date within a year.

The configurable engine approach separates the assessment capability (the pipeline) from the assessment framework (the criteria). The platform provides the ingestion, decomposition, search, verification, scoring, and reporting capabilities. The criteria are loaded as configurations. Change the criteria, and the assessment changes. No code. No development cycle. No waiting for a vendor to update.

What Configurability Enables

Proprietary Methodology Assessment

Consultancies don't assess against published frameworks alone. They have proprietary methodologies — their own approach to risk assessment, their own quality criteria for business cases, their own standards for project controls documentation. A configurable platform allows these proprietary frameworks to be loaded as assessment criteria, turning a consultancy's methodology into an automated assessment.

This is strategically important. The consultancy's methodology is its intellectual property. A platform that supports proprietary criteria means the consultancy keeps control of its methodology while gaining the benefits of automated assessment. The methodology isn't exposed. The assessment results are produced using it.

Multi-Framework Assessment

Major programmes often need to demonstrate compliance with multiple frameworks simultaneously. A programme preparing for an IPA gateway review while also maintaining CDM compliance and managing NEC contract obligations needs assessment against multiple criteria sets. A configurable platform runs all three in a single assessment cycle, against the same document suite, with consistent methodology.

Client-Specific Criteria

Every client has nuances. Highways England's requirements differ from Network Rail's. The Ministry of Defence's assessment criteria differ from the Department for Transport's. A configurable platform supports client-specific criteria sets that reflect these nuances, rather than forcing all clients into a generic framework.

Evolving Standards

When criteria change — as they inevitably do — the platform adapts through configuration, not through a development cycle. Load the updated criteria, and the next assessment uses them. The turnaround is days, not months.

The Difference Between "Configurable" and "Custom-Built"

A custom-built assessment tool takes months to develop, costs hundreds of thousands of pounds, and is obsolete when the criteria change. It assesses against one framework and does it well, but the investment in building it makes it inflexible.

A configurable platform takes hours to configure for a new framework, costs a fraction of custom development, and adapts when criteria change. It assesses against any framework that can be expressed as structured criteria, and each new framework builds on the capabilities developed for every previous one.

For consultancies operating across multiple domains with multiple clients, configurability isn't a nice-to-have. It's the difference between a tool that works on one engagement and a platform that works across the practice.

For Consultancies

Your methodology, your criteria, your standards. Configured as a module in a platform built for the major projects sector. Your framework becomes an automated assessment that your teams can run across every engagement, producing consistent results that your clients can trust. The IP stays yours. The capability becomes scalable.

Part Four

Reality Check

What AI assessment cannot do, and where the field is heading. Honest limitations and a grounded view of the future.

Chapter 10

What AI Assessment Cannot Do

This is the most important chapter in this guide.

Not because the limitations are surprising. Most practitioners will have already intuited several of them. But because the willingness to state limitations clearly, without hedging, is what separates a useful guide from a vendor pitch.

If you read nothing else in this document, read this chapter. The value of structured AI assessment depends entirely on understanding what it can and cannot do. Overestimating its capabilities is more dangerous than not using it at all.

It Cannot Replace Expert Judgement

This is stated as a principle in Chapter 4 and it's worth restating here as a limitation, because the temptation to ignore it will grow as the technology improves.

AI identifies gaps and cites evidence. Humans decide what to do about it. A RED rating against a criterion doesn't automatically mean the programme is in trouble. It might mean the evidence exists but wasn't captured in the documents assessed. It might mean the criterion is being addressed through a different mechanism. It might mean the gap is known and a plan is in place.

Expert judgement provides context that no document assessment can capture. The reviewer who knows that the Risk Management Strategy is being rewritten this month will interpret a RED rating differently from someone seeing it cold. The programme director who knows that a specific gap is covered by a ministerial direction will interpret the finding differently from the AI.

AI assessment produces findings. Expert judgement produces decisions. Conflating the two is the most common and most dangerous misuse of the technology.

It Cannot Assess Quality of Argument

AI can verify that a benefits realisation plan exists and contains quantified benefits with baseline measurements. It cannot judge whether the underlying assumptions are reasonable.

A business case might claim that a £2 billion infrastructure programme will deliver £8 billion in economic benefits over 30 years. The AI can verify that the claim is made, that the calculation methodology is stated, and that the supporting evidence is cited. But it cannot judge whether the assumptions underpinning the calculation are credible. That requires domain expertise, knowledge of comparable programmes, and professional judgement about what constitutes a reasonable forecast.

Similarly, a risk assessment might identify 15 risks and rate all of them as AMBER. The AI can verify the risk register exists and is populated. But it cannot judge whether 15 risks is suspiciously low for a programme of that complexity, or whether rating everything AMBER suggests the team hasn't differentiated between major and minor risks. An experienced reviewer spots these patterns instantly. AI does not.

It Cannot Assess Completeness of Thought

AI assesses what's written against criteria. It cannot identify important considerations that aren't mentioned at all.

If a programme's business case doesn't mention the impact of a major policy change announced last month, the AI has no way to flag this. The criteria don't mention the policy change (it's too recent for the framework to have been updated). The document doesn't mention it (the authors may not have considered it). The AI sees no gap because the criteria don't require it and the document doesn't address it.

An experienced reviewer who reads industry news would catch this immediately. "I notice the business case doesn't address the implications of the Spending Review announcement from March. How does that affect the Financial Case?"

These are the unknown unknowns of assessment. AI can systematically check for known unknowns (criteria that aren't addressed). It cannot identify unknown unknowns (things that should have been considered but weren't part of any criteria framework). This is where human expertise is irreplaceable.

It Struggles with Highly Ambiguous Criteria

Some criteria are specific and testable: "Does the Financial Case include a Net Present Value calculation?" Others are broad and subjective: "Does the programme demonstrate appropriate governance arrangements?"

The word "appropriate" requires human interpretation. What constitutes appropriate governance for a £100 million programme is different from what constitutes appropriate governance for a £10 billion programme. The AI can check that governance arrangements are described. It cannot judge whether they are appropriate for the programme's scale, complexity, and risk profile.

The more specific the criteria, the better the AI assessment. Binary questions ("Is X present?") produce reliable results. Qualitative questions ("Is X adequate?") require human review. When designing criteria frameworks for AI assessment, specificity is a design choice that directly affects assessment quality.

It Can Miss Context That a Human Reader Would Catch

Natural language contains subtlety that AI handles imperfectly. Hedging language — "the programme intends to develop a benefits realisation plan" vs. "the programme has developed a benefits realisation plan" — signals intent rather than action. An experienced reviewer reads this as a gap: the plan doesn't exist yet. AI may read it as evidence that the topic is addressed.

Contradictions between sections present similar challenges. Section 3 states the project will be delivered in two phases. Section 7 describes three phases. A human reader catches this on the second read. AI, which processes sections independently, may not link these contradictions unless specifically instructed to cross-check.

Tone and confidence matter too. A section that begins "Subject to further analysis..." is signalling uncertainty. A section that begins "The analysis confirms..." is signalling confidence. Both may contain the same factual content, but the implications for readiness are different. Human readers calibrate for this. AI generally does not.

It Requires Good Document Quality

The pipeline is only as good as what goes into it. Documents with poor OCR quality from scanning, heavily formatted documents where structure is embedded in visual layout rather than heading styles, and handwritten annotations or hand-drawn diagrams all limit assessment effectiveness.

This isn't unique to AI — poorly structured documents are harder for human reviewers too. But AI has less tolerance for structural ambiguity. A human can navigate a poorly formatted document through spatial awareness and pattern recognition. AI relies on structural cues that may not be present in a scanned PDF from 2004.

The practical implication: assessment quality improves when documents are well-structured, with clear headings, consistent formatting, and machine-readable text. This is another argument for assessment-informed document preparation (Chapter 8) — documents written with assessment in mind are inherently better structured.

It's Only As Good As Its Criteria

This is perhaps the most important limitation, because it's the one that organisations have the most control over.

Vague criteria produce vague assessments. If the criteria framework says "the programme should have a risk management approach," the assessment can only verify that some form of risk management approach is described. If the criteria framework says "the Risk Management Strategy should include: (a) risk identification methodology, (b) risk categorisation framework, (c) quantified risk appetite statement, (d) escalation thresholds, (e) risk review cycle," the assessment can verify each component individually and produce specific, actionable findings.

The quality of the input framework directly determines the quality of the output assessment. Organisations that invest in precise, specific criteria frameworks get precise, specific assessments. Organisations that rely on vague, high-level criteria frameworks get vague, high-level assessments.

The Guardrails Can Override Correct LLM Judgements

The deterministic guardrails described in Chapter 5 are the system's most powerful quality mechanism. They are also a source of occasional false corrections.

The guardrails apply fixed arithmetic rules to sub-question evidence. In most cases, this produces more accurate ratings than the LLM alone. But there are edge cases where the LLM's reasoning was actually correct, and the guardrail overrides it incorrectly. A criterion where the LLM rates AMBER because it detected hedging language and ambiguous evidence might be forced to GREEN by Guardrail A if the sub-question coverage percentage is high enough. The guardrail can't detect hedging language. It counts sub-question scores.

This is a known trade-off. The guardrails produce more accurate ratings across the population of criteria than the LLM alone. But any individual criterion might be rated incorrectly by the guardrail override. This is why human oversight remains non-negotiable: the expert reviewer is the final arbiter, and the system surfaces its reasoning — including when guardrails fired — to support that review.

Extended Tier Adds Cost and Time

The Extended Tier (CRAG, citation verification, self-critique, debate) produces more reliable assessments. It also takes longer and costs more per criterion. For a 218-criterion IPA assessment, the Extended Tier can add 15-25 minutes of processing time and significantly higher API costs compared to the Standard Tier.

This is not a universal improvement. For straightforward criteria with clear evidence, the Standard Tier produces accurate results. The Extended Tier adds value primarily on criteria where evidence is ambiguous, terminology is inconsistent, or the rating is near a boundary. Some organisations will run Extended Tier on all criteria for maximum rigour. Others will configure it selectively for high-stakes criteria. The choice depends on the use case, the budget, and the consequences of getting it wrong.

"Any vendor who tells you their AI can fully replace expert review is either lying or doesn't understand the domain. The value is in augmentation — making experts faster, more consistent, and more thorough. Anyone who claims more than that hasn't sat in the room when the findings are read out."

Summary: Where AI Adds Value and Where It Doesn't

AI Does Well AI Does Poorly
Systematic coverage checking against specific criteria Judging quality of argument or reasoning
Finding and citing specific evidence across large document sets Identifying what's missing when it's not in any criteria framework
Consistent scoring across multiple documents and assessments Interpreting ambiguous or subjective criteria
Cross-reference verification Reading organisational context and politics
Identifying gaps where evidence is absent Evaluating whether hedging language is a concern
Producing structured, traceable reports at scale Spotting contradictions across loosely related sections

Understanding this table is the foundation for using AI assessment effectively. The organisations that get the most value from it are the ones that deploy it where it's strong and supplement it with human expertise where it's weak. Not the ones that expect it to do everything.

Chapter 11

The Future of Document Assessment

The current state of structured AI assessment — what this guide describes — is the beginning, not the end. The field is moving quickly, and the organisations that understand where it's heading will make better decisions about adoption today.

From Point-in-Time to Continuous Assessment

Today, most document assessment happens at discrete points: before a gateway review, during a tender evaluation, when a regulator asks for evidence. The assessment is a snapshot. It tells you where you stood when the assessment was run, not where you stand now.

The future is continuous. Documents connected to the assessment platform are monitored as they're updated. When a new version of the Benefits Management Strategy is uploaded, the relevant criteria are re-assessed automatically. The programme team has a continuous view of readiness, not a point-in-time snapshot that's out of date by the time it's read.

This changes the management model. Instead of mobilising for readiness assessments, programme teams monitor readiness as a dashboard metric. Gaps are identified and closed as they emerge, not discovered in a rush before a review.

The Regulator Signal

In summer 2026, the UK Office for Nuclear Regulation will begin piloting AI-enhanced processes for its own Safety Assessment Principles review. When the nuclear regulator — one of the most cautious, evidence-driven regulatory bodies in the world — adopts AI-enhanced assessment, it sends an unambiguous signal about where the field is heading.

The implication for regulated organisations is straightforward: if the regulator is using AI to assess your submissions, you should be using AI to prepare them. Not because the regulator requires it, but because the quality standard the regulator expects will be calibrated against what AI-enhanced assessment can identify. The bar is rising.

This pattern will repeat. As regulators, review bodies, and assurance functions adopt structured AI assessment, the organisations they regulate and review will need to keep pace. Not to use the same tools, but to meet the quality standard that those tools establish.

Assessment Data as a Strategic Asset

When assessment is manual, the data stays in the reviewer's head. When assessment is structured and systematic, the data becomes an organisational asset. Patterns emerge across assessments. Common gaps are identified across programmes. The organisation learns, systematically, where its documentation practices are strong and where they need improvement.

This enables benchmarking — not just against the criteria framework, but against the organisation's own track record and against anonymised peer data. "Our business cases consistently score well on the Economic Case but underperform on the Management Case" is an insight that transforms how an organisation invests in capability development.

What Good Looks Like in Three to Five Years

The Practitioner's Role Evolves

The role of the experienced practitioner doesn't diminish. It changes. The mechanical work — reading documents, checking criteria, verifying cross-references — is increasingly handled by AI. The intellectual work — interpreting findings, assessing arguments, understanding context, making decisions — becomes more prominent.

This is an elevation, not a displacement. The programme controls professional who spent 60% of their time reading documents and 40% interpreting what they read now spends 80% of their time on interpretation and decisions. Their expertise becomes more valuable, not less, because it's applied to the work that only expertise can do.

"The question isn't whether AI will transform document assessment. It's whether your organisation will be leading that transformation or catching up. The window to lead is open now. It won't be open forever."
Appendices

Reference Material

Framework reference, glossary, and about Programme Insights.

Appendix A

Framework Reference

The following frameworks are referenced throughout this guide. Each represents a structured criteria set against which document assessment can be conducted.

Framework Issuing Body Scope Criteria Count
IPA Gate Review Infrastructure and Projects Authority Major programme readiness assessment across the Five Case Model at key decision points 218
Green Book Five Case HM Treasury Business case quality assessment covering Strategic, Economic, Commercial, Financial, and Management cases 98
ONR Licence Conditions Office for Nuclear Regulation Nuclear site licence compliance covering safety management, operations, and decommissioning 36
CDM 2015 Health and Safety Executive Construction safety documentation requirements for duty holders 48
NEC4 NEC Board (Institution of Civil Engineers) Contract compliance covering procedural obligations, programme management, and payment 62
APFP Regulations / DCO Planning Inspectorate Development Consent Order application documentation requirements for NSIPs Variable by scheme
ITT Evaluation Varies by procuring body Tender evaluation criteria for quality assessment of bid submissions Variable by procurement

Notes: Criteria counts are indicative and reflect the current version of each framework at time of publication. Counts vary by assessment stage (e.g., IPA criteria differ between Gate 0 and Gate 5). Organisations should verify current criteria with the issuing body.

Appendix B

Glossary

Term Definition
AMBER/RED IPA gateway rating indicating successful delivery is in doubt, with major risks or issues apparent in multiple areas.
Adversarial Debate An Extended Tier step where multiple virtual specialist personas argue opposing positions on AMBER boundary cases to stress-test the rating.
Sub-Question Decomposition The process of breaking complex assessment criteria into specific, testable sub-questions scored Yes (100%), Partial (70%), or No (0%).
Board-Ready Output formatted and structured to the standard required for programme board or investment committee review.
CDM 2015 Construction (Design and Management) Regulations 2015. UK health and safety regulations covering construction projects.
Citation Verification The process of confirming that cited evidence exists in the source document and says what the assessment claims.
Carry-Forward The mechanism by which criteria unaffected by document changes reuse results from a previous assessment run, enabling incremental re-assessment.
Confidence Calibration The system's assessment of its own certainty, flagging areas where findings involve interpretation rather than clear evidence.
CRAG Corrective Retrieval-Augmented Generation. An Extended Tier step that classifies retrieval quality and re-retrieves via decomposed sub-queries if initial evidence was poor or ambiguous.
Criteria Framework A structured set of assessment criteria against which documents are evaluated. May be published (IPA, Green Book) or proprietary.
DCO Development Consent Order. The planning consent mechanism for Nationally Significant Infrastructure Projects (NSIPs) under the Planning Act 2008.
Deterministic Guardrails Rules-based overrides that correct the LLM's rating when the sub-question coverage evidence disagrees with the model's judgement. Four guardrail rules enforce consistency between evidence and ratings.
Deterministic Scoring Scoring methodology where results are calculated from evidence found, not from probabilistic model outputs. Sub-questions scored Yes=100%, Partial=70%, No=0% with fixed RAG thresholds.
FBC Full Business Case. The final stage of business case development under the Five Case Model.
Five Case Model HM Treasury's framework for business case development: Strategic, Economic, Commercial, Financial, and Management cases.
Evidence Lineage The tracking of which criteria used which document chunks, enabling incremental re-assessment when documents change.
Fuzzy Matching Search technique that finds evidence using different terminology than the criteria. Matches concepts, not just keywords.
Gateway Review A peer review of a programme or project at a key decision point, conducted by an independent review team under the IPA framework.
Green Book HM Treasury's guidance on appraisal and evaluation in central government. Defines the Five Case Model for business cases.
Hallucination When an AI system generates information that appears plausible but is factually incorrect or fabricated.
IPA Infrastructure and Projects Authority. The UK government body responsible for assuring major projects and programmes.
Multi-Agent An AI architecture where multiple specialised agents work in coordination, each handling a specific part of the assessment process.
NEC4 New Engineering Contract, fourth edition. A suite of standard construction contracts published by the Institution of Civil Engineers.
NPV Net Present Value. The present value of expected benefits minus the present value of expected costs, using a discount rate.
OBC Outline Business Case. The second stage of business case development under the Five Case Model.
ONR Office for Nuclear Regulation. The independent regulator of nuclear safety and security in the UK.
RAG Rating Red, Amber, Green rating system. The standard assessment rating system in UK programme management.
Reciprocal Rank Fusion A technique that merges results from multiple search methods (semantic + keyword) by giving credit to documents ranked highly by either method.
SAPs Safety Assessment Principles. ONR's framework for assessing safety cases submitted by nuclear licensees.
Self-Critique An Extended Tier step where the system questions its own rating when confidence falls below 70%, using three targeted probing questions that can adjust the rating.
SOC Strategic Outline Case. The first stage of business case development under the Five Case Model.
SRO Senior Responsible Owner. The individual personally accountable for a programme's success, typically a senior civil servant or director.

About Programme Insights

Programme Insights is a purpose-built AI assessment platform for UK infrastructure programmes. It implements the structured assessment methodology described in this guide: criteria-first assessment with sub-question decomposition, hybrid semantic and keyword retrieval with CRAG, deterministic scoring with rules-based guardrails, citation verification, self-critique, adversarial debate on boundary cases, incremental re-assessment via evidence lineage, and board-ready report generation.

The platform supports assessment against IPA gateway criteria, HM Treasury Green Book, ONR licence conditions, CDM 2015, NEC4 contract compliance, DCO regulatory requirements, and custom criteria frameworks. It is configurable for consultancy proprietary methodologies without code changes.

Built in the UK, hosted on Azure UK South, ICO registered, and designed for UK government-grade data handling. Programme Insights serves programme teams, consultancies, and public sector organisations responsible for major project assurance.

Every mechanism described in this guide — the deterministic guardrails, the CRAG retrieval, the adversarial debate, the evidence lineage tracking — is operational in Programme Insights today. The guide stands alone as a reference, but the methodology described is not theoretical. It's running in production.

Two capabilities extend the assessment pipeline beyond findings into action:

Human-in-the-Loop Review. Every assessment finding can be queried, challenged, or overridden by the reviewer. Challenge a rating with your expertise and the system re-assesses that criterion with your input. Every interaction is logged as a traceable audit trail that flows into PDF reports. The review interaction is the professional work product, not a friction layer on top of it. Read the full guide →

Content Generation. When the assessment identifies a gap with a known structure — a missing sensitivity analysis table, an incomplete RACI matrix, an absent escalation procedure — the platform drafts structured content using data already in the uploaded documents. Field-level confidence scores show which values are sourced and which need human verification. Read the full guide →

Request a Demo

See how your documents score against the criteria that matter. A demo takes 30 minutes.

programmeinsights.com

Programme Insights

AI-powered document assessment for UK infrastructure programmes.

The platform handles the reading.
You handle the decisions.

programmeinsights.com