The Mechanism Behind Evidence-Based Document Assessment
Why structured assessment matters — and why "AI read your document" is not the same as evidence-based assessment with guardrails.
Most AI document tools work like this: they feed your document into a language model, ask it to evaluate against some criteria, and return whatever the model says. No verification. No structure. No way to know if the answer is grounded in evidence or fabricated from training data.
That approach fails for serious assessment work. When a programme director needs to know whether their business case addresses all five HMT Green Book dimensions, or whether an IPA gateway submission covers the required delivery confidence criteria, "the AI thinks it looks fine" is not an acceptable answer.
Programme Insights uses a multi-step assessment pipeline that separates evidence retrieval from judgement, applies deterministic guardrails over AI scoring, and optionally runs adversarial verification on borderline results. Every rating traces back to specific passages in your documents.
The system does not trust the AI. It retrieves evidence, scores it with structured output, then applies rule-based guardrails that can override the AI's rating. Every claim maps to a citation. Borderline results get debated by specialist personas.
This guide explains each step of that pipeline: what it does, why it makes assessment more accurate, and what goes wrong without it. It is written for technical evaluators who need to understand the mechanism, not just the marketing.
Three layers work together: orchestration for reliability, standard assessment for every criterion, and extended verification for high-stakes results.
| Layer | Steps | Purpose |
|---|---|---|
| Orchestration | 10 steps | Run management, crash recovery, progress streaming, criteria loading |
| Standard Tier | 2 steps per criterion | Evidence retrieval + scoring with deterministic guardrails |
| Extended Tier | 4 additional steps | Corrective RAG, citation verification, self-critique, adversarial debate |
| Model | Used For | Why This Model |
|---|---|---|
| GPT-4o | Scoring, summary, debate | Strongest reasoning for judgement tasks. Temperature 0.0 for reproducibility |
| GPT-4.1-mini | CRAG, critique, citation verification | Fast, cost-effective for classification and verification tasks |
| text-embedding-3-small | Vector embeddings | High-quality semantic search at low latency |
Runs for every criterion in every assessment. Two steps: retrieve evidence, then score it.
Activates when results need deeper scrutiny. Four additional verification steps.
The infrastructure that ensures assessments are reliable, resumable, and observable — before a single criterion is evaluated.
An assessment run against a complex framework can take several minutes. Documents may be large, criteria sets can include 50+ items, and network or API failures happen. The orchestration layer handles all of this so that the assessment logic never has to.
Semantic search (0.7 weight) catches evidence expressed in different words — "cost-benefit analysis" matching "economic appraisal". Keyword search (0.3 weight) catches exact terms that semantic models sometimes miss — specific clause numbers, acronyms, regulatory references. The 0.35 threshold filters out noise without discarding borderline-relevant evidence.
Steps 7-8 handle post-assessment work. The Summary Generator creates a Delivery Confidence Assessment (DCA) rating and an Executive Summary by aggregating individual criterion scores. The Finalise Run step marks the run as complete, records timing metrics, and triggers any downstream notifications.
Steps 9-10 provide structured error handling: per-criterion timeouts (300s standard, 480s extended), graceful degradation when individual criteria fail, and comprehensive error logging for debugging.
Step 1 of every criterion assessment. Finding the right evidence is harder than judging it — and most AI tools get this wrong.
If you ask an LLM "does this document address stakeholder engagement?", it searches for those exact words. But the evidence might be described as "community consultation", "public participation", or "interested party management". A single query misses evidence that exists but uses different language.
For each criterion, the system generates 2-3 query variants that capture different ways the same concept might be expressed. A criterion about "risk management" might expand to queries about "risk register", "mitigation strategies", and "threat assessment".
Each query variant runs against two search systems simultaneously:
| Search Type | Technology | What It Catches | What It Misses |
|---|---|---|---|
| Semantic | pgvector embeddings | Conceptually similar text, paraphrases, domain synonyms | Exact terms, clause numbers, acronyms |
| Keyword | Full-text search | Precise terms, regulatory references, specific numbers | Rephrased concepts, alternative terminology |
Results from all query variants and both search types are combined using RRF. This algorithm merges ranked lists by giving higher scores to chunks that appear in multiple result sets. A chunk that ranks #3 in semantic search and #5 in keyword search gets a combined score higher than a chunk that ranks #1 in only one system.
The retrieval step produces 15 ranked evidence chunks, each with a relevance score, source document reference, and page location. These chunks — not the full documents — are what the scoring step evaluates.
Single-query, single-method search typically finds 40-60% of relevant evidence. That means the scoring step is making judgements based on incomplete information. Your document might thoroughly address a criterion, but if the search only finds half the evidence, it gets scored as partially compliant. Query expansion + hybrid search + RRF consistently retrieves 85-95% of relevant evidence.
Step 2 of every criterion assessment. The AI scores the evidence — but deterministic rules have the final word.
GPT-4o evaluates the 15 retrieved evidence chunks against the criterion using structured output mode — the model returns a defined JSON schema, not free text. Temperature is set to 0.0, meaning the same evidence produces the same score every time. No creative interpretation. No hallucinated confidence.
For each criterion, the model produces two independent ratings:
| Rating | What It Measures | Scale |
|---|---|---|
| Coverage | Does the evidence address the criterion? How thoroughly? | RED / AMBER / GREEN |
| Delivery Confidence | Is the approach credible? Are the plans realistic and achievable? | RED / AMBER / GREEN |
Coverage answers "is it there?" Delivery confidence answers "is it good?" A business case might thoroughly cover financial projections (GREEN coverage) but with unrealistic assumptions (RED delivery confidence). These are different problems requiring different responses.
The scoring prompt instructs GPT-4o to adopt the role of a specialist assessor appropriate to the framework being used. For IPA gateway assessments, it acts as an experienced IPA reviewer. For Green Book appraisals, it acts as a Treasury-trained economist. This persona framing improves the relevance and rigour of the assessment without changing the deterministic guardrails that override the AI's output.
A document that mentions every required topic (GREEN coverage) but with vague, aspirational language and no concrete plans (RED delivery confidence) needs a different remediation strategy than one with solid plans that simply forgot to address a criterion (RED coverage). The dual rating tells you what kind of work is needed.
The system does not trust the AI. It verifies. These rules are hard-coded — no prompt engineering can override them.
After GPT-4o produces its structured score, the result passes through a set of deterministic rules. These are not AI suggestions. They are if/then logic that runs in application code. The AI cannot talk its way past them.
Each criterion can be broken into sub-questions. The system scores each sub-question individually, then calculates a coverage percentage:
| Sub-Question Result | Score Assigned | Example |
|---|---|---|
| Yes — fully addressed | 100% | Risk register exists with mitigations and owners |
| Partial — mentioned but incomplete | 70% | Risks listed but no mitigation strategies |
| No — not addressed | 0% | No mention of risk management |
The sub-question scores are averaged into a coverage percentage, which maps to a RAG rating:
| Coverage % | Rating | Meaning |
|---|---|---|
| 70% and above | Criterion is well addressed. Evidence is comprehensive. | |
| 40% – 69% | Partially addressed. Gaps exist that need attention. | |
| Below 40% | Not adequately addressed. Significant work required. |
These thresholds are not tunable per-run. Every assessment uses the same boundaries. This makes results comparable across documents, across time, and across assessors. When a criterion moves from AMBER to GREEN between versions, you know the evidence genuinely improved — the goalposts did not move.
Four rules that can override the AI's rating. They catch the cases where LLMs are systematically wrong.
LLMs have known failure modes in assessment work: they are overconfident about vague text, they miss the difference between aspirational language and concrete plans, and they sometimes rate highly when retrieval found very little evidence. These four guardrails catch those failures.
Trigger: Coverage is AND negative evidence is 20% or less AND delivery confidence is .
Action: Override to .
Why: The AI is being cautious about minor delivery concerns when the evidence is overwhelmingly positive. A few hedged phrases should not downgrade a well-evidenced criterion.
Trigger: Coverage is AND delivery confidence is .
Action: Override to .
Why: The document talks about the topic extensively but the plans are not credible. The AI might rate GREEN because the words are there, but the substance is not.
Trigger: Coverage is AND delivery confidence is .
Action: Override to .
Why: The evidence is sparse but what exists is strong. This often means the document addresses the criterion well but in a different section or using unexpected terminology. Worth flagging for review rather than marking as a failure.
Trigger: Retrieval quality is flagged as poor AND delivery confidence is .
Action: Override to .
Why: If the system could not find much evidence but the AI still rates GREEN, something is wrong. The AI may be drawing on training data rather than your document. This guardrail forces a conservative rating when the evidence base is thin.
Retrieved evidence: 11 of 15 chunks are relevant. Document has a dedicated stakeholder section.
Sub-questions: Stakeholder mapping (Yes, 100%) + Engagement methods (Partial, 70%) + Feedback mechanisms (Yes, 100%) + Communication plan (No, 0%)
Coverage score: (100 + 70 + 100 + 0) / 4 = 67.5% =
AI delivery confidence: — "the approach is well-structured"
Guardrails check: No guardrail triggers (coverage is AMBER, not RED/GREEN extremes)
Final rating: — missing communication plan drops this below GREEN despite strong delivery
When the retrieval step fails to find enough evidence, CRAG diagnoses why and tries again with better queries.
Even with query expansion and hybrid search, retrieval can fail. The document might use highly specialised terminology, the relevant section might be embedded in an appendix, or the criterion might span multiple disconnected sections. The standard retrieval step does not know whether it succeeded — it returns 15 chunks regardless of quality.
CRAG (Corrective Retrieval-Augmented Generation) adds a quality check after retrieval. GPT-4.1-mini classifies the retrieval result into one of three categories:
| Classification | What It Means | What Happens Next |
|---|---|---|
| Correct | Retrieved chunks are relevant and sufficient | Proceed to scoring with current evidence |
| Ambiguous | Some relevant chunks, but gaps exist | Decompose criterion into sub-queries, re-retrieve for gaps |
| Incorrect | Retrieved chunks are mostly irrelevant | Generate entirely new queries, re-retrieve from scratch |
When classified as ambiguous, CRAG breaks the criterion into more specific sub-queries. For example, a criterion about "project governance arrangements" might decompose into:
Each sub-query runs through the same hybrid search pipeline. The results are merged with the original retrieval using RRF, producing a richer evidence set for scoring.
Without CRAG, poor retrieval flows silently into scoring. The scoring model rates based on whatever was retrieved, producing confident-sounding assessments of the wrong evidence. Guardrail D catches some of these cases, but CRAG prevents the problem at source rather than patching it downstream.
Every claim the system makes must trace back to your document. This step ensures it does.
Stanford research found that LLMs hallucinate in 17-33% of legal document analysis tasks. The model generates text that sounds authoritative but does not appear in the source material. In assessment work, this means a criterion might be rated GREEN based on evidence the AI invented rather than evidence your document contains.
This is not a theoretical risk. It happens routinely with off-the-shelf AI tools applied to document assessment.
After scoring, GPT-4.1-mini performs a verbatim check. For each claim in the assessment output, it:
Semantic similarity is not enough. A claim that "the document addresses environmental impact" could be semantically similar to a passage about "ecological considerations" — but if the passage does not actually discuss environmental impact in the way the criterion requires, the citation is misleading. Verbatim checking ensures the evidence genuinely supports the claim.
Every assessment result includes a citation chain: criterion → rating → supporting evidence → source document → page number. A reviewer can follow any rating back to the exact text that supports it. If the citation verification step flagged any issues, those flags are visible in the results.
When the system is uncertain, it challenges its own reasoning before committing to a score.
Self-critique triggers when the scoring model's confidence falls below 70%. This is not a random threshold — it corresponds to the boundary where LLM scoring reliability drops measurably. Above 70%, the model's structured output is generally consistent and well-grounded. Below 70%, the model is uncertain, and its rating becomes less reliable.
When triggered, GPT-4.1-mini asks the scoring output three diagnostic questions:
"Is the evidence retrieved actually sufficient to support this rating, or is the model filling gaps with assumptions?"
Catches cases where the model scored based on what it expected to find rather than what was actually retrieved.
"Does the justification logically support the rating, or could the same evidence support a different rating?"
Catches circular reasoning — where the justification restates the rating rather than explaining it.
"Is there evidence in the retrieved chunks that contradicts this rating? Was it acknowledged or ignored?"
Catches confirmation bias — where the model focused on supporting evidence and ignored contradicting evidence.
If the self-critique identifies issues, the scoring step is re-run with the critique findings included as additional context. The model sees its own weaknesses and can correct for them. If the re-scored result still has low confidence, the criterion is flagged for human review in the final report.
Self-critique leverages a well-documented property of LLMs: they are better at evaluating reasoning than generating it. The scoring model might produce a shaky justification, but the critique model can reliably identify that it is shaky. The combination produces more honest uncertainty signals than either step alone.
AMBER results get stress-tested by specialist personas arguing opposing positions. The boundary between AMBER and GREEN (or AMBER and RED) is where assessments matter most.
GREEN and RED ratings are usually unambiguous. The evidence clearly addresses the criterion, or it clearly does not. AMBER is where the difficult judgement calls live: is this "partially addressed" or "nearly complete"? Is this a genuine gap or a terminology mismatch?
These boundary decisions are exactly where human assessors disagree with each other, and where AI assessors are least reliable. The debate step forces the system to stress-test its AMBER ratings before finalising them.
GPT-4o runs a multi-persona adversarial debate. The personas depend on the framework:
| Framework | Debate Panel | Perspectives |
|---|---|---|
| IPA Gateway | 3 IPA specialists | Delivery confidence, commercial viability, governance & assurance |
| General frameworks | 2 generalists | One argues for upgrade (GREEN), one argues for downgrade (RED) |
Without adversarial debate, AMBER ratings are accepted without scrutiny. In practice, this means roughly 30% of AMBER ratings should be GREEN (the evidence is there but was under-weighted) and 10% should be RED (the evidence was over-interpreted). The debate step resolves these misclassifications before they reach the user.
Individual criterion scores roll up into a Delivery Confidence Assessment and an Executive Summary that tells the story behind the numbers.
Criteria are not treated as a flat list. They are grouped into dimensions — the higher-level categories that a framework uses. For IPA assessments, these might be: Strategic Case, Economic Case, Commercial Case, Financial Case, and Management Case. For custom frameworks, the grouping comes from the module pack or user-defined structure.
Each dimension gets its own summary score, derived from its constituent criteria. This reveals patterns: a business case might score GREEN on Strategic and Economic cases but RED on Management Case, signalling a specific weakness rather than overall failure.
Some criteria are flagged as showstoppers in the module pack. These are non-negotiable requirements where a RED rating means the assessment cannot receive an overall GREEN regardless of how well other criteria score. Examples include:
The overall Delivery Confidence Assessment combines dimension scores, showstopper flags, and a holistic judgement. It follows a rules + holistic model:
Any RED showstopper forces overall AMBER or RED. All dimensions GREEN and no showstoppers allows overall GREEN. These are non-negotiable.
Within the boundaries set by rules, GPT-4o synthesises the criterion-level findings into a narrative. It identifies themes ("governance is strong but delivery planning is underdeveloped"), highlights the 3-5 most impactful findings, and recommends priority actions. This is the executive summary that programme leaders actually read.
When your documents change, the system re-assesses only what is affected — not everything from scratch.
Documents evolve between assessment runs. A section gets rewritten, new appendices are added, figures are updated. A full re-assessment every time is wasteful and makes it impossible to track what actually changed. You need to know: "this criterion moved from AMBER to GREEN because the stakeholder engagement section was expanded", not just "it is GREEN now".
The system maintains an evidence_links table that records which document chunks contributed to which criterion scores. When a document is re-uploaded, the system identifies which chunks changed by comparing embeddings and content hashes.
At the start of each criterion assessment in a new run, the system checks: has any evidence linked to this criterion changed since the last run? If not, the previous score carries forward without re-assessment. This is step 6 of the orchestration layer — the carry-forward check runs before the agentic pipeline.
Only criteria with changed evidence go through the full pipeline (retrieval, scoring, guardrails, and optionally the extended tier). The result is a new run that includes both carried-forward scores and freshly assessed scores, all under a single version number.
| Scenario | Full Re-Assessment | Incremental |
|---|---|---|
| 50-criterion framework, 3 sections changed | ~25 minutes, 50 LLM calls | ~4 minutes, 8 LLM calls |
| Version comparison | Compare two full runs manually | System highlights exactly what changed and why |
| Audit trail | Two independent snapshots | Linked lineage showing score progression per criterion |
Infrastructure programmes iterate documents through multiple review cycles. Being able to say "we addressed 12 of the 15 AMBER findings from the last assessment, and here is the evidence for each" is transformative for governance meetings. It turns assessment from a point-in-time snapshot into a continuous improvement tool.
Not all AI document assessment is created equal. Here is how Programme Insights compares to the alternatives.
| Capability | Programme Insights | ChatGPT / Generic AI | Manual Assessment | Checkbox Tools |
|---|---|---|---|---|
| Evidence retrieval | Hybrid search + query expansion + RRF | Single-pass context window | Human reading | Keyword matching |
| Scoring method | Structured output + deterministic guardrails | Free-text opinion | Expert judgement | Binary pass/fail |
| Hallucination control | Citation verification + source tracing | None | N/A (human reads source) | None |
| Borderline handling | Adversarial debate with specialist personas | None | Panel discussion (if available) | None |
| Reproducibility | Temp 0.0 + fixed guardrails = same input, same output | Variable (temperature, prompt drift) | Low (assessor variability) | High (but shallow) |
| Audit trail | Criterion → evidence → source → page | None | Written report (if created) | Checkbox record |
| Incremental updates | Evidence lineage, targeted re-assessment | Full re-run required | Full re-read required | Full re-check required |
| Time for 50 criteria | 8-15 minutes | Not feasible (context limits) | 3-5 days | 1-2 hours (shallow) |
| Framework awareness | IPA, Green Book, NEC, nuclear, custom | General knowledge only | Depends on assessor | Template-dependent |
| Cost per assessment | Minutes of compute | Subscription + time | Consultant day rates | Licence + time |
Generic AI tools give you an opinion. Programme Insights gives you an evidence-based assessment with an audit trail. The opinion might be right. The evidence-based assessment shows you why it is right — and gives you something you can defend in a governance meeting.
A demo takes 30 minutes. Bring a document you are working on — we will assess it live against the framework of your choice.
programmeinsights.comThis guide explains the mechanism. The demo shows the results. If you assess documents against frameworks — whether IPA gateway reviews, Green Book appraisals, NEC compliance, or your own internal standards — this is built for you.
We built Programme Insights because we spent years watching good programmes get poor assessments due to time pressure, and watching AI tools produce confident nonsense. There is a better way.
Other guides in this series:
What is Programme Insights? — Product overview for decision-makers
Security & Data Handling — For information security teams