The AI Assesssor Is the Wrong Idea

Why AI Should Shape Decisions,
Not Just Assess Applications

A White Paper by Geoffrey Clow Founder, Expert Grant Program Advisory

EXECUTIVE SUMMARY

The grant sector is under pressure to modernise assessment. Governments and funders are being pitched AI tools that promise faster processing, greater consistency, and reduced administrative burden. Many are buying. Almost none are asking the right questions.

The tools currently on offer operate almost entirely at the application layer: eligibility checking, document summarisation, preliminary scoring. This is first-generation thinking. It solves a workflow problem while leaving the integrity, equity, and intelligence problems in grantmaking completely untouched.

This paper argues that the AI assessor, a tool that reads and scores applications, is not the destination. It is a distraction. The real opportunity lies in AI that operates across five distinct layers of the grantmaking system: application processing, assessor cognition support, portfolio intelligence, equity and bias auditing, and programme learning across rounds. Almost no current product addresses more than the first of these. Almost no funder is asking vendors which layers they actually touch.

The core argument is simple. AI should improve human judgment, not simulate it. It belongs where pattern recognition, consistency checking, and institutional memory can strengthen decisions. It does not belong where values are traded off, where accountability is required, or where a funding decision must be explained to the people it affects.

Most tools improve throughput. Almost none improve judgment.

This paper provides a five-layer framework for thinking about AI in grantmaking, documents the evidence for why equity and bias auditing is the sector’s most urgent unmet need, and closes with a practical set of questions that every funder and programme manager should put to a vendor before signing anything.

Funders being pitched AI assessment tools are not being asked to adopt technology. They are being asked to adopt a design philosophy. This is not a technology decision. It is a design decision about how power is exercised. This paper closes with the questions you should be asking before you sign anything.

 

SECTION ONE: THE PROBLEM AND THE LANDSCAPE

The wrong question

The grant sector has built AI tools that process applications. It should have built AI tools that improve decisions.

Those are not the same thing.

The difference is not technical. It is structural. And it is where most of the money, time, and genuine potential in AI for grantmaking is currently being lost.

This is not an argument against AI. It is an argument against a specific, widely shared mistake: treating the application as the unit of analysis, building technology to process it faster, and calling that progress.

The application is the most visible surface of a grantmaking system. It is not the system.

The system is the judgment happening under pressure, which is to say the point in a long assessment day where consistency quietly collapses and nobody notices. The portfolio that emerges from dozens of small, inconsistent decisions. The criteria that reward fluency in funder language while filtering out the very communities the programme was designed to reach.

AI that operates only at the application layer does not improve this system. It accelerates it.

 

Where we are now

The current state of AI in grantmaking can be described in a single sentence: most tools cluster in a narrow slice of the process and almost no one is thinking across the full system.

The dominant pattern remains manual review supplemented by light automation. Panels of assessors, working against rubrics of variable quality, produce recommendations that programme managers exercise discretion over. Bias checks, where they exist at all, are ad hoc. Learning between rounds is largely informal, carried in the heads of people who may not be in the room next time. AI has entered this system at its most visible point, the application, without touching the layers where the real decisions are shaped.

A brief map of what currently exists:

Fluxx positions its AI features around distilling complex application data into actionable summaries and streamlining reviewer workflows. Primarily Layer One, with some Layer Two workflow support.¹

Submittable and Foundant GLM, the dominant platforms for high-volume processing and community foundation grantmaking respectively, have introduced AI-assisted intake, automated eligibility screening, and reviewer management tools. Layer One.

Instrumentl’s Apply module generates proposal drafts from existing funder data and helps applicants understand alignment with funder criteria. An applicant-side Layer One tool.

SmartyGrants, Australia’s leading grants management platform, is developing a product it has named Tessa the Assessor, explicitly positioned around streamlining the assessment of grant applications. Layer One by design and by intent.²

Post-hoc DEI analytics tools, emerging in the philanthropic sector primarily in North America, provide retrospective reporting on funded portfolio demographics. These approach Layer Four but operate after decisions are made, not within the assessment process itself.

The pattern is clear. The market has built faster roads to the same destination. No current mainstream product asks what kind of portfolio the applications should produce. None analyse whether the assessment criteria are generating useful signal. None surface whether the programme’s funding history contains patterns of structural disadvantage worth examining before the next round opens.

Existing tools cluster in narrow slices and almost no one is thinking across all five layers.

 

A different frame

With that landscape sketched, here is a different way to think about where AI belongs in assessment. Not as a tool that processes what applicants submit, but as a capability that operates across the whole system, from the moment an assessor picks up a form to the moment a programme manager asks whether this round’s decisions look like the programme’s stated intent. The five layers below describe what that looks like in practice.

 

SECTION TWO: THE FIVE LAYERS

Where AI can actually change something

The question is no longer whether to use AI in grantmaking. It is what kind of intelligence we want our grant systems to have.

The five layers are: application processing, assessor cognition support, portfolio intelligence, equity and bias auditing, and programme learning. Each is more consequential than the one before it. The framework below defines what operating at each layer actually means.

Think of grantmaking less like a conveyor belt and more like a kitchen. The application form is what arrives at the pass. But long before it gets there, someone designed the menu, wrote the recipes, trained the line, and decided what “done” looks like. You can put a robot on the pass. It will plate the food faster. It will not fix the menu, the recipe, or the kitchen losing its best people.

The five layers below describe where AI can add genuine intelligence, not just processing speed. They are not a product taxonomy. They are a design framework. Each layer is more consequential than the one below it.

 

Layer One: Application Processing

Eligibility checking. Completeness flagging. Document summarisation. Basic scoring assistance. Done well, this layer reduces administrative burden meaningfully. It is also the only layer most current tools occupy. That is the problem this paper is about.

 

Layer Two: Assessor Cognition Support

The assessor is not the problem to be automated away. She is the most valuable part of the process. She carries domain knowledge no training dataset fully captures, and ethical judgment no rubric fully encodes. What she lacks, usually, is time, consistency tools, and a way to sense-check her own reasoning against the broader pattern of the round.

An AI copilot for assessors is completely different from an AI that processes applications. A reviewer highlights a paragraph; the copilot identifies which criteria it speaks to, suggests a provisional score, and flags where this passage conflicts with how similar passages were scored in previous rounds. Not to override the assessor. To give that judgment better material to work with.

An argument decomposer restructures the case for support as a set of claims, evidence, assumptions, and risks. Not a summary. A logic map. The assessor can now interrogate whether the argument actually holds, rather than navigating prose that may be eloquent but hollow, or rough-hewn but sound.

Then there is the counterfactual assistant, which may be the most practically powerful idea in this paper. It allows an assessor to ask in real time: if I lower this score from four to three, which other applications in this round are now treated differently, and does that feel consistent? That question is never currently asked. Assessors work through applications sequentially, and nobody watches for the drift that accumulates across a long day of scoring. A tool that makes portfolio-level consistency visible during the assessment process, not after, changes the quality of judgment in ways no post-hoc audit can.

 

Layer Three: Portfolio Intelligence

Most grant programmes are funded on intent. The strategy says something about equity, regional spread, emerging organisations, evidence-based practice. Then the round closes, the decisions are made, and someone in the strategy team looks at the funded list and notices that, once again, metropolitan organisations with full-time grant-writers took the majority of the funding.

There is usually a conversation. There is rarely a mechanism.

A portfolio-first model inverts the usual logic. Instead of assessing each application in isolation and aggregating whatever portfolio emerges, the programme defines the portfolio it is trying to build first, and assessment works toward that shape. A target-seeking optimiser takes the programme’s stated parameters across geography, organisation type, thematic focus, evidence base, and risk tolerance, and generates the combination of applications that best approximates that target. Where the data cannot satisfy all parameters simultaneously, it produces explicit trade-off explanations. The programme manager now has a conversation grounded in evidence rather than instinct.

Diversity stress-testing goes further. Before recommendations are locked, the system simulates thousands of alternative panel compositions and criteria weightings and identifies where results are fragile: where the outcome depends heavily on one reviewer’s preferences, or where a small change in weighting flips the funded cohort significantly. If your portfolio is one tired assessor away from looking completely different, you should know that.

The missing applications radar may be the most politically courageous tool in this layer. Using open data, community mapping, and historical round records, it identifies which communities, geographies, or issue areas should be present in the application pool but are not. That absence is not a neutral fact. It is a signal about barriers to access. Surfaced before the next round opens, it becomes an outreach brief. Ignored, it becomes a perpetuated inequity.

 

Layer Four: Equity and Bias Auditing

Most funding inequity is not the result of bad decisions. It is the result of systems that were never designed to notice it.

This is the layer that makes funders uncomfortable, which is precisely why it matters most.

The evidence is not speculative. It is documented, consistently, across some of the best-resourced funding systems in the world. Research into US National Institutes of Health grant outcomes found that applications from Black and African-American principal investigators were funded at significantly lower rates than those from white investigators, a disparity that persisted after controlling for measurable differences in application quality.³ Analysis of the US National Science Foundation found that in some research directorates, white principal investigators experienced a funding rate advantage of 1.7 times over Black and African-American applicants.⁴ Canadian health research data showed that female applicants with equivalent past success rates to their male counterparts received lower reviewer scores, and less experienced male applicants were funded at higher rates.⁵ A separate analysis of Canada’s Natural Sciences and Engineering Research Council found systemic bias targeting applicants from smaller institutions, consistent across three separate assessment criteria and multiple funding years.⁶

These are not anomalies. They are patterns. And they are invisible inside any single assessment round because they only become visible at scale, across time, when someone is actually looking for them.

Running a machine learning model trained on historical funding decisions in shadow mode alongside current assessment, comparing its recommendations with panel outcomes to locate systematic patterns of disadvantage, is a technique borrowed directly from credit risk auditing, where similar questions of fairness and systemic bias are actively measured, and applied to a context where the stakes are no less serious. Fairness testing tools allow programme managers to run scenarios: what happens to our funding distribution if we reduce the weight we place on organisational track record for first-time applicants? What changes if we stop using writing quality as a proxy for organisational capacity?

Bias-aware triage confines AI assistance to the clearly ineligible and obviously uncompetitive ends of the pool, while continuously auditing false negatives by applicant group, geography, and organisation type. If your triage tool is disproportionately eliminating applications from rural organisations, from culturally and linguistically diverse groups, or from organisations under three years old, that pattern should surface before it becomes a funding record.

One thing needs to be said plainly: AI trained on historical data will, without careful design, reproduce the biases embedded in that data. Automation is not a fairness technology by default. It becomes one only when the programme explicitly designs for fairness measurement and builds the human accountability structures to act on what it finds. The goal here is not to launder inequitable outcomes through an algorithm. The goal is to make bias visible, measurable, and addressable. That is a design choice, not a product feature. No vendor can make it for you.

 

SECTION THREE: LAYER FIVE AND THE LEARNING SYSTEM

Programme Learning

The hardest thing to change in grantmaking is the form.

Not because forms are sacred, but because changing them is work, and the people with institutional memory to know which questions have stopped generating useful signal are rarely the same people with authority and bandwidth to rewrite the guidelines. So programmes run the same questions round after round. Applicants learn to answer them. The criteria drift away from what the programme actually cares about. Nobody can quite articulate why the funded cohort looks slightly wrong each time.

A self-tuning rubric analysis looks at every criterion after every round and asks: did this discriminate? Did it separate strong proposals from weak ones, or did it reward fluency in funder language regardless of underlying quality? Low-signal elements are flagged for revision before the next guidelines cycle. The programme now has a feedback loop it currently lacks.

Synthetic application modelling generates plausible edge cases at the margins of the criteria, trained on historical applications, to stress-test assessment logic before real applicants encounter its failure points. And for jurisdictions running multiple programmes, the cross-programme meta-assessor looks across schemes to identify duplication, conflicting incentives, and gaps where clearly emerging community needs fall between all current schemes. Not to centralise decision-making. To make the architecture of public grantmaking visible to the people responsible for designing it.

 

What it looks like when Layer Five doesn’t exist

Consider a community infrastructure programme — the kind that exists in dozens of jurisdictions, administered by a state or territory agency, running annual rounds with a modest pool of between two and five million dollars. The criteria were written in year one by a small policy team working under time pressure. They were sensible criteria: demonstrated community need, organisational capacity, evidence of partnerships, value for money. Nobody disagreed with them. Nobody tested them.

Six rounds later, the programme has funded forty-three projects. Thirty-one of them have gone to twelve organisations. Nine of those twelve are based in the three largest metropolitan areas. The programme’s stated intent includes regional equity and support for emerging community organisations. The funded portfolio reflects neither.

Nobody made a decision to produce that outcome. No assessor chose to favour metropolitan organisations. No panel meeting concluded that first-time applicants were less worthy. What happened was quieter and harder to see: the criteria rewarded track record, because track record felt like evidence of capacity. Track record favoured established organisations, because established organisations had track record. Established organisations were concentrated in metropolitan areas, because that is where organisations tend to become established. The logic was circular from the beginning. It just took six rounds for the circle to close completely.

At some point a new programme manager looks at the cumulative data and asks the question that should have been asked in year two: what if the problem is the criteria, not the applicants?

That question is Layer Five. Not as an insight, which programme managers occasionally have, but as a designed capability: measurement built in from round one, signal analysis after every round, criteria review before the next guidelines cycle opens. Without that infrastructure, the question arrives too late, after the portfolio has fossilised and the excluded communities have stopped applying.⁷

 

SECTION FOUR: THE RED LINES

What AI should never be allowed to do

The funding decision belongs to a human being who can be named, questioned, and held accountable. That is not a legal formality. It is the load-bearing wall of the whole structure. Remove it and everything above it becomes a mechanism for obscuring the exercise of power behind a model’s output.

Here are the lines that should not be crossed, stated without softening.

AI should not make funding decisions. Not recommend with such confidence that the human reviewer is functionally a rubber stamp. Not operate at such volume or speed that meaningful human review is impossible. The human in the decision seat needs enough information, enough independence, and enough time to exercise genuine judgment. If those conditions don’t exist, the governance is fiction.

AI outputs must be explainable to an unsuccessful applicant. If a model cannot produce a plain-language account of why an application was deprioritised, it cannot be used in any process where applicants have a right to reasons. In most public and philanthropic grantmaking, they do.

The communities most affected by a programme’s decisions should have a say in what fairness means before an AI is designed to measure it. Fairness is not a technical parameter. It is a values question, and it belongs to the people with the most at stake, not the people with the most compute.

The most dangerous version of AI in grantmaking is not the one that makes bad decisions. It is the one that launders decisions already being made, giving them the appearance of objectivity while the actual logic stays opaque and uncontested. That is worse than a panel room with a tired assessor. At least that assessor can be asked why.

 

SECTION FIVE: DESIGN PRINCIPLES AND VENDOR QUESTIONS

What to build, and how to build it right

Start with the question, not the tool. What are you actually trying to improve? Faster processing is a layer-one problem. Better decisions, fairer reach, and institutional learning are layers two through five. Know which problem you have before you buy anything.

Build the governance before you build the model. Who owns the AI output? Who can challenge it? Who is responsible when it is wrong? These questions must be answered before deployment, not after the first complaint.

Audit for equity from round one. Don’t wait until a pattern is visible to the naked eye. Define what fairness means for your programme’s specific context, build measurement in from the beginning, and design the system to surface deviations.

Treat the training data as a policy document. Historical decisions encode years of values, pressures, and errors. Before you train anything on that data, understand what it contains. An uncritical model is a mechanism for repeating the past more efficiently.

Demand an audit trail. Every AI-assisted decision should produce a record of what the system recommended, what the human decided, and where those diverged. If the product doesn’t produce that record, you are flying blind.

 

What to demand from vendors

Which of the five layers does your product actually operate on? A product that summarises applications and checks eligibility is a layer-one tool. That is fine, but it should be said clearly. It is not an AI assessment solution.

How does your product handle applicants who write in English as a second language, or who come from communities with lower formal grant-writing capacity? If the vendor hasn’t thought about this, keep asking until they have.

Can your product explain a funding recommendation in plain language to an unsuccessful applicant? If not, it cannot be used in any process where applicants have a right to reasons.

How is bias measured and reported across rounds? Bias mitigation in training data is not the same as ongoing bias auditing. Ask what the difference is. Watch for confusion.

What happens when your product is wrong? Where does the error go? Who reviews it? What is the remediation path for an applicant disadvantaged by a model error?

Can we see the audit trail? If the answer is no, the answer to the contract should also be no.

 

CONCLUSION: THE ROOM THAT NEEDS TO CHANGE

Governments and funders are under real pressure to modernise grant assessment. That pressure is legitimate. Assessment processes are slow, inconsistent, and, as the evidence now clearly shows, systematically inequitable in ways that no amount of reviewer training or diversity statement has managed to fix. The demand for better tools is not misplaced. The tools being offered in response to that demand are.

The AI assessor, a product that reads and scores applications faster and more consistently than a tired panel, addresses a symptom. The five-layer framework in this paper addresses the system. That is the distinction that matters, and it is the distinction that funders, programme managers, and policy leads need to be able to make when a vendor arrives with a pitch deck and a product called something like Tessa.

The contribution of this paper is a vocabulary and a framework. The five-layer model gives the sector language for a conversation it is not yet having. The vendor questions in Section Five give procurement decision-makers a set of tests that separate serious products from polished pitches. The equity evidence in Layer Four gives programme managers the empirical grounding to make the bias-auditing case internally, not as a values argument but as a documented pattern with measurable consequences.

The call to action is practical. Before adopting any AI product for grant assessment, commission a layered design review that maps the proposed tool against all five layers and names which ones it touches and which ones it leaves untouched. Run a Layer Five audit of your current programme: map your funded portfolio against your stated intent across every round you have data for, and ask what the gap reveals. Insist that every vendor you engage with can answer the six questions in Section Five completely and specifically.

Back to the panel room. The coffee is still mediocre. The stack is still real. The exhausted assessor still has a flight to catch.

But here is what could be different. The assessor copilot flags a contradiction between an application’s evidence claims and how similar claims scored last round. The portfolio dashboard shows three of the strongest-scoring applications this round come from the same two postcodes, and the programme target says something different. The missing applications radar generated a targeted outreach campaign six weeks before the round closed, and two applications arrived from communities that have never engaged before. The rubric has been reweighted based on last round’s signal analysis, and two questions that consistently rewarded writing fluency over organisational capability have been removed.

The human still decides. The human should still decide. But the human now has something they have never had before: the full picture.

That is what AI in grantmaking can be when it is designed as a systems tool. It does not replace the judgment in that room. It makes the judgment worth having.

The sector that builds toward that, rather than settling for the AI assessor, will fund better things, reach further into communities that need capital, and learn from its own history instead of repeating it.

That is worth building toward.

Expert Grant Program Advisory specialises in grant programme design, assessment architecture, and the application of intelligent systems to public and philanthropic grantmaking by Geoffrey Clow. Feel free to get in touch here

NOTES

¹ Fluxx AI product documentation. fluxx.io/ai-grants-management-software. Accessed March 2026.

² Schulz, M. (2026, March 24). SmartyGrants is getting smarter with the way it uses artificial intelligence. Institute of Grants Management. Group Managing Director Denis Moriarty names the intended AI use cases as summarising complex applications, assisting with assessment, identifying anomalies or compliance breaches, analysing large datasets, and improving reporting and programme evaluation. All five sit at Layer One or early Layer Two.

³ Ginther, D.K. et al. (2011). Race, Ethnicity, and NIH Research Awards. Science, 333(6045), 1015–1019. See also: eLife(2021). Equity, Diversity and Inclusion: Racial inequity in grant funding from the US National Institutes of Health. elifesciences.org/articles/65697.

Taffe, M.A. & Gilpin, N.W. (2022). Systemic racial disparities in funding rates at the National Science Foundation. eLife, 11, e83071.

Tamblyn, R. et al. (2018). The association between gender and review of grant applications by the Canadian Institutes of Health Research. CMAJ, 190(16).

Lavergne, M. & Malacrino, D. (2016). Bias in Research Grant Evaluation Has Dire Consequences for Small Universities. PLOS ONE, 11(6). doi:10.1371/journal.pone.0155876.

⁷ For a framework on how evaluation systems can be designed to detect and correct criteria drift, see: Open Research Funders Group & Health Research Alliance. (2023). Exploring Unconscious Bias in Grant Review. Open & Equitable Model Funding Program. orfg.org.

more white papers