“Thou shalt use AI responsibly”
The current debate about AI often gets stuck in a fairly
familiar place.
You can use AI, but you need to use it “well”.
Do not pass AI-generated work off as your own.
Do not let it do your analysis for you.
Do not let it write your report for you.
Be careful: it hallucinates references.
All of this is true. But I think it misses something more
subtle.
AI, chocolate and cognitive offloading
For many evaluators, researchers and knowledge workers, the
real risk is not that we deliberately ask AI to do work we know we should be
doing ourselves. The real risk is that, when we are tired, under pressure,
cognitively overloaded or facing a difficult piece of judgement, we slowly let
the tool carry more of the thinking than we intended.
The issue is cognitive offloading.
Cognitive offloading is not new. Humans have always used
external tools to reduce mental effort: notebooks, calculators, checklists,
templates, search engines, colleagues, diagrams and frameworks. In that sense,
AI is part of a long history of tools that extend our thinking.
But generative AI changes the nature of the offloading. It
does not only store information or speed up a calculation. It can summarise,
classify, infer, draft, compare, frame and argue. In other words, it can begin
to look like analysis.
And here is the uncomfortable part: even when we know the
risks, we are still human.
For anyone who has ever tried to diet while sitting next to
chocolate, this is not hard to understand. You may have a clear rule. You may
genuinely believe in the rule. You may even have defended the rule publicly.
But when you are tired, stressed or depleted, the chocolate becomes much harder
to resist.
AI is a bit like that.
The concern is not only: Did you use AI?
The more important question is: At what point did the
judgement move?
This is where I think Guba and Lincoln have something
surprisingly useful to offer.
Guba and Lincoln enters the room….
In a meeting recently, someone made a throwaway comment about Guba and Lincoln. I responded with a small verbal protest. This was probably lost on anyone who has wisely avoided the old evaluation “paradigm wars” — including the Lincoln–Sechrest debate, where Lincoln stood for constructivist, stakeholder-shaped evaluation, and Sechrest defended disciplined evidence, validity and causal reasoning.
Niche evaluation lore? Definitely.
Still useful for thinking about AI? Surprisingly, yes.
But afterwards I kept thinking about it. Guba and Lincoln’s Fourth
Generation Evaluation challenged the idea that evaluation quality should be
judged only by the conventional research criteria of validity, reliability,
generalisability and objectivity. For constructivist, qualitative and
naturalistic inquiry, they proposed a different language of quality: credibility,
transferability, dependability and confirmability.
These criteria have since become familiar shorthand for
trustworthiness in qualitative research. And trustworthiness, it seems to me,
is exactly where the AI debate needs to go.

From “Thou shalt not” to practical questions
Telling people not to offload their thinking is probably not
enough.
It is a bit like saying:
Do not be biased.
Do not make weak inferences.
Do not overclaim.
Do not eat the chocolate.
The instruction is correct, but not very operational.
Guba and Lincoln help translate the concern into practical
checks. Let me start with one of their criteria: dependability.
Dependability: is the process traceable?
The dependability criterion asks whether the inquiry process
is coherent, logical and traceable.
In qualitative research, this usually means that the process
should be documented well enough for someone else to understand how the
findings were produced. It does not require mechanical replication, but it does
require an audit trail: records of key decisions, changes in focus, coding
choices, analytic memos, and the movement from raw data to interpretation.
For AI use, I find dependability especially useful because
it helps me ask:
Is the human analytic process still visible, or has the
route from evidence to claim become blurred?
Application 1: using AI to strengthen a coding framework
When I use AI in qualitative analysis, I do not start by
asking it to “find the themes”. I begin by reading through the data myself,
making notes on possible themes, codes, tensions and patterns.
Only then do I use AI as a thinking partner. For example, I
might ask:
“Let’s develop a deductive coding framework. Based on the
themes I have identified, suggest a coding memo for each theme, possible
sub-themes, and five quotes I can use to test whether the code is working.”
At that point, the work becomes iterative. I might respond:
“I like the distinction you are making here, but I think
these two codes overlap.”
“This code is too broad.”
“This seems to miss what participants are saying about implementation
conditions.”
“Let’s test this theme against more quotes.”
I may then move into a more inductive round:
“I want to see what the data says about this theme. Pull
together all quotes that seem related to it, including quotes that complicate
or contradict it.”
This process does not make the coding framework dependable
because AI produced it. It becomes more dependable because the framework has
been tested, revised and documented through several rounds of analytic
questioning.
AI helps speed up the iteration, but the analytic judgement
remains mine.
Application 2: checking whether an AI-supported executive summary is dependable
I also use the dependability criterion when AI helps me
draft or refine an executive summary.
An executive summary can become polished very quickly. That
is useful, but also risky. The more fluent the summary becomes, the easier it
is to lose sight of where each claim came from.
So I ask AI to help me rebuild the audit trail:
“Please create a table listing each claim in the executive
summary. For each claim, indicate which document, section, dataset or finding
it is based on.”
I then check the table against the source material. I
correct weak links, remove claims that are not well supported, and adjust
wording where the claim has become too strong.
I may then ask:
“Based on the information I provided, are there any
findings, stakeholder views or pieces of evidence that might contradict or
complicate these claims?”
This helps me check whether the summary has become too neat.
If there are counterclaims or important qualifications, I decide whether the
executive summary needs to acknowledge them, soften the wording, or adjust the
argument.
Only after that do I use AI for board-ready refinement:
“Please simplify this into a clearer board-ready message.”
“Make the tone more direct.”
“Reduce repetition.”
“Sharpen the main implication.”
Again, the key issue is not whether AI was used. The key
issue is whether I can still trace the movement from evidence, to claim, to
final wording.
For me, that is what dependability looks like in an AI
workflow: not avoiding AI, but using it in a way that keeps the analytic path
visible.
And then, where it matters, I put the evidence trail into
annexes.
Often many, many pages of annexes.
Will everyone read them? Almost certainly not. But that is
not the only point. The annexes make the reasoning visible and verifiable. They
show how claims were built, what evidence they rest on, and where the
qualifications sit.
They also create a written record that future machine
synthesis will probably love: structured, source-linked, explicit reasoning
that can be checked, reused and interrogated.
So yes, the annexes may be unloved by human readers.
But they are part of the dependability infrastructure.
Beyond dependability
I have focused here on dependability because it is the
criterion that most directly helps me manage AI use in my workflow. It asks
whether the process is traceable — whether the path from evidence, to
interpretation, to final wording is still visible.
But the same logic applies to the other trustworthiness
criteria.
For credibility, I use AI to test whether claims are
well supported, whether counter-evidence has been considered, and whether the
final account is believable in relation to the data.
For transferability, I check whether the context has
survived the synthesis: whether the specific conditions, boundaries and setting
are still visible, rather than being smoothed into a generic lesson.
For confirmability, I ask whether I can still
distinguish between what came from the evidence, what came from my judgement,
and what AI helped me formulate.
The bottom line
So for me, Guba and Lincoln do not provide a nostalgic
evaluation reference. They offer a practical way of asking a very current
question:
Not “Did you use AI?” but “Was the use of AI
trustworthy?”**
**************************************************************
**Yes, Charlie and I wrote this blogpost together - the giveaway ?
The "Not this, but that" structure (also known
as contrastive negation or antithetical parallelism) is a highly
recognizable rhetorical pattern heavily favored by ChatGPT and other LLMs.
*** Photo by June Gathercole on Unsplash
