Friday, July 03, 2026

What Guba and Lincoln offer to the use of AI debate

“Thou shalt use AI responsibly”





The current debate about AI often gets stuck in a fairly familiar place.

You can use AI, but you need to use it “well”.
Do not pass AI-generated work off as your own.
Do not let it do your analysis for you.
Do not let it write your report for you.
Be careful: it hallucinates references.

All of this is true. But I think it misses something more subtle.

AI, chocolate and cognitive offloading

For many evaluators, researchers and knowledge workers, the real risk is not that we deliberately ask AI to do work we know we should be doing ourselves. The real risk is that, when we are tired, under pressure, cognitively overloaded or facing a difficult piece of judgement, we slowly let the tool carry more of the thinking than we intended.

The issue is cognitive offloading.

Cognitive offloading is not new. Humans have always used external tools to reduce mental effort: notebooks, calculators, checklists, templates, search engines, colleagues, diagrams and frameworks. In that sense, AI is part of a long history of tools that extend our thinking.

But generative AI changes the nature of the offloading. It does not only store information or speed up a calculation. It can summarise, classify, infer, draft, compare, frame and argue. In other words, it can begin to look like analysis.

And here is the uncomfortable part: even when we know the risks, we are still human.

For anyone who has ever tried to diet while sitting next to chocolate, this is not hard to understand. You may have a clear rule. You may genuinely believe in the rule. You may even have defended the rule publicly. But when you are tired, stressed or depleted, the chocolate becomes much harder to resist.

AI is a bit like that.

The concern is not only: Did you use AI?

The more important question is: At what point did the judgement move?

This is where I think Guba and Lincoln have something surprisingly useful to offer.

 

Guba and Lincoln enters the room….

In a meeting recently, someone made a throwaway comment about Guba and Lincoln. I responded with a small verbal protest. This was probably lost on anyone who has wisely avoided the old evaluation “paradigm wars” — including the Lincoln–Sechrest debate, where Lincoln stood for constructivist, stakeholder-shaped evaluation, and Sechrest defended disciplined evidence, validity and causal reasoning.

Niche evaluation lore? Definitely. 

Still useful for thinking about AI? Surprisingly, yes.

But afterwards I kept thinking about it. Guba and Lincoln’s Fourth Generation Evaluation challenged the idea that evaluation quality should be judged only by the conventional research criteria of validity, reliability, generalisability and objectivity. For constructivist, qualitative and naturalistic inquiry, they proposed a different language of quality: credibility, transferability, dependability and confirmability.

These criteria have since become familiar shorthand for trustworthiness in qualitative research. And trustworthiness, it seems to me, is exactly where the AI debate needs to go.


 From “Thou shalt not” to practical questions

Telling people not to offload their thinking is probably not enough.

It is a bit like saying:

Do not be biased.
Do not make weak inferences.
Do not overclaim.
Do not eat the chocolate.

The instruction is correct, but not very operational.

Guba and Lincoln help translate the concern into practical checks. Let me start with one of their criteria: dependability.

Dependability: is the process traceable?

The dependability criterion asks whether the inquiry process is coherent, logical and traceable.

In qualitative research, this usually means that the process should be documented well enough for someone else to understand how the findings were produced. It does not require mechanical replication, but it does require an audit trail: records of key decisions, changes in focus, coding choices, analytic memos, and the movement from raw data to interpretation.

For AI use, I find dependability especially useful because it helps me ask:

Is the human analytic process still visible, or has the route from evidence to claim become blurred?

Application 1: using AI to strengthen a coding framework

When I use AI in qualitative analysis, I do not start by asking it to “find the themes”. I begin by reading through the data myself, making notes on possible themes, codes, tensions and patterns.

Only then do I use AI as a thinking partner. For example, I might ask:

“Let’s develop a deductive coding framework. Based on the themes I have identified, suggest a coding memo for each theme, possible sub-themes, and five quotes I can use to test whether the code is working.”

At that point, the work becomes iterative. I might respond:

“I like the distinction you are making here, but I think these two codes overlap.”
“This code is too broad.”
“This seems to miss what participants are saying about implementation conditions.”
“Let’s test this theme against more quotes.”

I may then move into a more inductive round:

“I want to see what the data says about this theme. Pull together all quotes that seem related to it, including quotes that complicate or contradict it.”

This process does not make the coding framework dependable because AI produced it. It becomes more dependable because the framework has been tested, revised and documented through several rounds of analytic questioning.

AI helps speed up the iteration, but the analytic judgement remains mine.

Application 2: checking whether an AI-supported executive summary is dependable

I also use the dependability criterion when AI helps me draft or refine an executive summary.

An executive summary can become polished very quickly. That is useful, but also risky. The more fluent the summary becomes, the easier it is to lose sight of where each claim came from.

So I ask AI to help me rebuild the audit trail:

“Please create a table listing each claim in the executive summary. For each claim, indicate which document, section, dataset or finding it is based on.”

I then check the table against the source material. I correct weak links, remove claims that are not well supported, and adjust wording where the claim has become too strong.

I may then ask:

“Based on the information I provided, are there any findings, stakeholder views or pieces of evidence that might contradict or complicate these claims?”

This helps me check whether the summary has become too neat. If there are counterclaims or important qualifications, I decide whether the executive summary needs to acknowledge them, soften the wording, or adjust the argument.

Only after that do I use AI for board-ready refinement:

“Please simplify this into a clearer board-ready message.”
“Make the tone more direct.”
“Reduce repetition.”
“Sharpen the main implication.”

Again, the key issue is not whether AI was used. The key issue is whether I can still trace the movement from evidence, to claim, to final wording.

For me, that is what dependability looks like in an AI workflow: not avoiding AI, but using it in a way that keeps the analytic path visible.

And then, where it matters, I put the evidence trail into annexes.

Often many, many pages of annexes.

Will everyone read them? Almost certainly not. But that is not the only point. The annexes make the reasoning visible and verifiable. They show how claims were built, what evidence they rest on, and where the qualifications sit.

They also create a written record that future machine synthesis will probably love: structured, source-linked, explicit reasoning that can be checked, reused and interrogated.

So yes, the annexes may be unloved by human readers.

But they are part of the dependability infrastructure.

Beyond dependability

I have focused here on dependability because it is the criterion that most directly helps me manage AI use in my workflow. It asks whether the process is traceable — whether the path from evidence, to interpretation, to final wording is still visible.

But the same logic applies to the other trustworthiness criteria.

For credibility, I use AI to test whether claims are well supported, whether counter-evidence has been considered, and whether the final account is believable in relation to the data.

For transferability, I check whether the context has survived the synthesis: whether the specific conditions, boundaries and setting are still visible, rather than being smoothed into a generic lesson.

For confirmability, I ask whether I can still distinguish between what came from the evidence, what came from my judgement, and what AI helped me formulate.

The bottom line

So for me, Guba and Lincoln do not provide a nostalgic evaluation reference. They offer a practical way of asking a very current question:

Not “Did you use AI?” but “Was the use of AI trustworthy?”**

 And if the answer is yes, can you show how?

**************************************************************

**Yes, Charlie and I wrote this blogpost together - the giveaway ?

The "Not this, but that" structure (also known as contrastive negation or antithetical parallelism) is a highly recognizable rhetorical pattern heavily favored by ChatGPT and other LLMs.

*** Photo by June Gathercole on Unsplash



Thursday, March 21, 2024

A visualization on statistics choices

Another lovely resource - a decision tree to help choose the appropriate statistic to apply. 
And if you want to have step-by-step instructions on how to create it in SPSS, see this SAGE resource 





Wednesday, March 20, 2024

A visualisation of data visualization choices!

I always love it if people can simplify the complicated decisionmaking processes that we have in our heads, to a simple decision tree. A visualisation of data visualizationn choices!


 

 



Tuesday, March 05, 2024

Applying systems theoretical concepts to understand sustainability of education intervention outcomes

 

This Master’s dissertation addresses the research question: To what extent can the systems concept ‘extended dynamic sustainability’ be used to explain why some results of a donor-funded education development intervention were sustained ten years after its conclusion?



To address that question, the researcher identified a specific case to explore with systems thinking: an ex-post evaluation conducted in 2016, and commissioned by an international donor, the United States Agency for International Development (USAID). That ex-post evaluation confirmed that an education development intervention, the Kimberly Thusanang Programme (KTP) implemented between 1998 and 2006, resulted in sustained outcomes, which were directly linked to the KTP’s goal of improving school governance in the Francis Baard education district the Northern Cape.
The Master’s research builds on the ex-post evaluation’s analysis. Using qualitative data analysis, the researcher identified the types of sustainability found in the ex-post evaluation data set. Then, by applying Stockmann’s (1993a) ‘extended dynamic sustainability’ concept, the Master’s research found that the KTP intervention and some of its benefits were dynamically sustained through the general causal sustainability mechanisms of problem-solving, modelling and multiplication.
These findings are likely useful to research intervention sustainability, to design sustainable development interventions, and to evaluate intervention success. Further exploration of these general sustainability mechanisms needs to be conducted to determine if these mechanisms are generalisable to other development interventions and their sustained outcomes.

EvalEdge Podcast about Storytelling in Evaluation

 


EVALEDGE PODCATS LOGO

In this episode of EvalEdge, Asgar Bhikoo and I talk about Storytelling in Evaluation Practice. This episode focused on exploring current lessons related to the use of story-telling as innovation in evaluation practice in Africa. For more information, check out Digital Stories for Impact and Social Impact Storytelling: Using Impact Data to Drive Change.

Here is the impact story tool referenced in the podcast, also here on the Civicus repository