M&E Blog: July 2026

“Thou shalt use AI responsibly”

The current debate about AI often gets stuck in a fairly familiar place.

You can use AI, but you need to use it “well”.
Do not pass AI-generated work off as your own.
Do not let it do your analysis for you.
Do not let it write your report for you.
Be careful: it hallucinates references.

All of this is true. But I think it misses something more subtle.

AI, chocolate and cognitive offloading

For many evaluators, researchers and knowledge workers, the real risk is not that we deliberately ask AI to do work we know we should be doing ourselves. The real risk is that, when we are tired, under pressure, cognitively overloaded or facing a difficult piece of judgement, we slowly let the tool carry more of the thinking than we intended.

The issue is cognitive offloading.

Cognitive offloading is not new. Humans have always used external tools to reduce mental effort: notebooks, calculators, checklists, templates, search engines, colleagues, diagrams and frameworks. In that sense, AI is part of a long history of tools that extend our thinking.

But generative AI changes the nature of the offloading. It does not only store information or speed up a calculation. It can summarise, classify, infer, draft, compare, frame and argue. In other words, it can begin to look like analysis.

And here is the uncomfortable part: even when we know the risks, we are still human.

For anyone who has ever tried to diet while sitting next to chocolate, this is not hard to understand. You may have a clear rule. You may genuinely believe in the rule. You may even have defended the rule publicly. But when you are tired, stressed or depleted, the chocolate becomes much harder to resist.

AI is a bit like that.

The concern is not only: Did you use AI?

The more important question is: At what point did the judgement move?

This is where I think Guba and Lincoln have something surprisingly useful to offer.

Guba and Lincoln enters the room….

In a meeting recently, someone made a throwaway comment about Guba and Lincoln. I responded with a small verbal protest. This was probably lost on anyone who has wisely avoided the old evaluation “paradigm wars” — including the Lincoln–Sechrest debate, where Lincoln stood for constructivist, stakeholder-shaped evaluation, and Sechrest defended disciplined evidence, validity and causal reasoning.

Niche evaluation lore? Definitely.

Still useful for thinking about AI? Surprisingly, yes.

But afterwards I kept thinking about it. Guba and Lincoln’s Fourth Generation Evaluation challenged the idea that evaluation quality should be judged only by the conventional research criteria of validity, reliability, generalisability and objectivity. For constructivist, qualitative and naturalistic inquiry, they proposed a different language of quality: credibility, transferability, dependability and confirmability.

These criteria have since become familiar shorthand for trustworthiness in qualitative research. And trustworthiness, it seems to me, is exactly where the AI debate needs to go.

From “Thou shalt not” to practical questions

Telling people not to offload their thinking is probably not enough.

It is a bit like saying:

Do not be biased.
Do not make weak inferences.
Do not overclaim.
Do not eat the chocolate.

The instruction is correct, but not very operational.

Guba and Lincoln help translate the concern into practical checks. Let me start with one of their criteria: dependability.

Dependability: is the process traceable?

The dependability criterion asks whether the inquiry process is coherent, logical and traceable.

In qualitative research, this usually means that the process should be documented well enough for someone else to understand how the findings were produced. It does not require mechanical replication, but it does require an audit trail: records of key decisions, changes in focus, coding choices, analytic memos, and the movement from raw data to interpretation.

For AI use, I find dependability especially useful because it helps me ask:

Is the human analytic process still visible, or has the route from evidence to claim become blurred?

Application 1: using AI to strengthen a coding framework

When I use AI in qualitative analysis, I do not start by asking it to “find the themes”. I begin by reading through the data myself, making notes on possible themes, codes, tensions and patterns.

Only then do I use AI as a thinking partner. For example, I might ask:

“Let’s develop a deductive coding framework. Based on the themes I have identified, suggest a coding memo for each theme, possible sub-themes, and five quotes I can use to test whether the code is working.”

At that point, the work becomes iterative. I might respond:

“I like the distinction you are making here, but I think these two codes overlap.”
“This code is too broad.”
“This seems to miss what participants are saying about implementation conditions.”
“Let’s test this theme against more quotes.”

I may then move into a more inductive round:

“I want to see what the data says about this theme. Pull together all quotes that seem related to it, including quotes that complicate or contradict it.”

This process does not make the coding framework dependable because AI produced it. It becomes more dependable because the framework has been tested, revised and documented through several rounds of analytic questioning.

AI helps speed up the iteration, but the analytic judgement remains mine.

Application 2: checking whether an AI-supported executive summary is dependable

I also use the dependability criterion when AI helps me draft or refine an executive summary.

An executive summary can become polished very quickly. That is useful, but also risky. The more fluent the summary becomes, the easier it is to lose sight of where each claim came from.

So I ask AI to help me rebuild the audit trail:

“Please create a table listing each claim in the executive summary. For each claim, indicate which document, section, dataset or finding it is based on.”

I then check the table against the source material. I correct weak links, remove claims that are not well supported, and adjust wording where the claim has become too strong.

I may then ask:

“Based on the information I provided, are there any findings, stakeholder views or pieces of evidence that might contradict or complicate these claims?”

This helps me check whether the summary has become too neat. If there are counterclaims or important qualifications, I decide whether the executive summary needs to acknowledge them, soften the wording, or adjust the argument.

Only after that do I use AI for board-ready refinement:

“Please simplify this into a clearer board-ready message.”
“Make the tone more direct.”
“Reduce repetition.”
“Sharpen the main implication.”

Again, the key issue is not whether AI was used. The key issue is whether I can still trace the movement from evidence, to claim, to final wording.

For me, that is what dependability looks like in an AI workflow: not avoiding AI, but using it in a way that keeps the analytic path visible.

And then, where it matters, I put the evidence trail into annexes.

Often many, many pages of annexes.

Will everyone read them? Almost certainly not. But that is not the only point. The annexes make the reasoning visible and verifiable. They show how claims were built, what evidence they rest on, and where the qualifications sit.

They also create a written record that future machine synthesis will probably love: structured, source-linked, explicit reasoning that can be checked, reused and interrogated.

So yes, the annexes may be unloved by human readers.

But they are part of the dependability infrastructure.

Beyond dependability

I have focused here on dependability because it is the criterion that most directly helps me manage AI use in my workflow. It asks whether the process is traceable — whether the path from evidence, to interpretation, to final wording is still visible.

But the same logic applies to the other trustworthiness criteria.

For credibility, I use AI to test whether claims are well supported, whether counter-evidence has been considered, and whether the final account is believable in relation to the data.

For transferability, I check whether the context has survived the synthesis: whether the specific conditions, boundaries and setting are still visible, rather than being smoothed into a generic lesson.

For confirmability, I ask whether I can still distinguish between what came from the evidence, what came from my judgement, and what AI helped me formulate.

The bottom line

So for me, Guba and Lincoln do not provide a nostalgic evaluation reference. They offer a practical way of asking a very current question:

Not “Did you use AI?” but “Was the use of AI trustworthy?”**

And if the answer is yes, can you show how?

**************************************************************

**Yes, Charlie and I wrote this blogpost together - the giveaway ?

The "Not this, but that" structure (also known as contrastive negation or antithetical parallelism) is a highly recognizable rhetorical pattern heavily favored by ChatGPT and other LLMs.

*** Photo by June Gathercole on Unsplash

M&E Blog

Friday, July 03, 2026

What Guba and Lincoln offer to the use of AI debate