Cresta's Shin et al. (2025) KDD paper on factuality assessment proposed the 3D paradigm
for factuality evaluation in an LLM-as-judge framework.
The paradigm was specifically designed to assess analytical claims against conversational reference texts that are not explicitly verifiable because they are implied, rather than explicitly stated.
Here, we try to build on this approach by directing the LLM to process conversational data through the Question Under Discussion (QUD) framework. Whereas Shin et al.'s prompt focuses on dissecting the to-be-verified claim and leaves the transcript itself untouched, our QUD-based approach transforms the transcripts and leaves the claim untouched.
We use the synthetic datasets published alongside the original paper to compare the original 3D paradigm to the novel QUD-based approach, focusing on the OpenAI models only.
Just like the original paper, we compare F1 scores on the task of detecting False
claims.
Given a conversation and a short answer, verify the short answer by referencing the conversation. First, break down the short answer into claims using the `A Step to Extract Claims` below. Next, verify each part of the claim and the relation between each part of the claim using `Steps to Evaluate Each Claim` below. ## A Step to Extract Claims ## Step 1: Identify claims from the short answer. Example: "Customer was annoyed about slow delivery" -> "There was a delivery", "The delivery was slow", "Customer was annoyed", "Customer was annoyed specifically about slow delivery" ## Steps to Evaluate Each Claim ## Step 2: In each claim, identify words that have concrete meanings. Example: "There was a delivery" -> "delivery". Verify those words by finding explicit mentions or references. When a word or a phrase can be interpreted in more than one way, see if at least one interpretation can be verified. Example: If a conversation includes discussions of receiving email notifications, this verifies one meaning of "delivery". Step 3: In each claim, identify words that subjectively describe other words having concrete meanings. These words often describe a product or a service. Example: "The delivery was slow" -> "slow". Verify these words loosely with the context of the conversation. Step 4: In each claim, identify words that are about subjective interpretation of the conversation. These words often describe sentiments and emotions from a third-person point of view. Example: "Customer was annoyed" -> "annoyed". Verify these words by finding minimal implicit evidence. Example: "annoyed" is verified with implicit evidence reflecting negative sentiment. Step 5: In each claim, verify the relation between words. Focus on verifying the relation between words, while ignoring the verifications of the words themselves in this step. Verify the relation with explicit evidence or by inferring the reason behind an action or a message. Example: "Customer was annoyed specifically about slow delivery" -> Verify that the source of a customer's sentiment was indeed the "slow delivery" while ignoring the verifications of "slow" and "annoyed". If a customer asks about filing a complaint after discussing slow delivery without explicitly expressing a negative sentiment, the customer must have been annoyed by the slow delivery. This inferred reason behind the customer's action verifies the relation. ## Output for Each Claim as JSON ## 1. claim: The exact claim (as identified in Step 1). 2. reasoning: A detailed reasoning for whether the claim was verified or not. 3. is_claim_verified: True if the claim was verified in Steps 2, 3, 4 and 5; otherwise False. ## Output Format as JSON ## claims: list of all the claims generated above in the mentioned format. reasoning: A concise summary of the reasoning for the final answer. answer: True or False (True if short_answer is verified; otherwise, False).
Given a conversation and a short answer, verify the short answer by referencing the conversation. First, transform the conversation into a structured representation where each utterance is annotated with the Question Under Discussion (QUD) it addresses, using the `Steps to Transform the Conversation` below. Next, verify the claim with reference to the transformed conversation. ## Steps to Transform the Conversation ## Step 1: Read the conversation carefully and break it down into individual utterances. Keep speaker labels (e.g., Agent, Customer). Step 2: Identify explicit and implicit questions guiding the conversation. Questions can be literal (e.g., "When does the store close?") or inferred from the context (e.g., a greeting implies the QUD: "How do we open this call politely?"). Step 3: Assign a QUD to each utterance. If the utterance introduces a new question, note it as "Introduces QUD X." If the utterance answers or acknowledges a question, note which QUD it resolves or addresses. Include implicit social QUDs (e.g., greetings, closings). Step 4: Output the conversation in a structured format like this: ``` Speaker: Utterance - QUD: [Describe the current question under discussion or note introduction/resolution] ``` Step 5: Create a final hierarchy of QUDs at the end that shows how questions are nested or related, as in the following example: ``` QUD 1: How to open the call? → QUD 2: What does the customer need? → QUD 3: When are the events? ``` Example Output (Simplified): ``` Agent: Good afternoon! - QUD: Opening the call politely. Customer: Hi, I'm interested in upcoming events. - QUD: Introduces main goal — What events are available? Agent: May I put you on hold while I check? - QUD: Introduces sub-question — Is it okay to put the customer on hold? Customer: Sure, I don't mind waiting. - QUD: Resolves sub-question about being placed on hold. ``` Final Hierarchy: ``` QUD 1: How to open the call? → QUD 2: What events are available? → QUD 3: May I put the customer on hold? ``` ## Output Format as JSON ## reasoning: A concise summary of the reasoning for the final answer. answer: True or False (True if short_answer is verified; otherwise, False).
from pathlib import Path
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime
data = []
for p in Path("../data").resolve().glob("*.json"):
model, prompt, ttc, date, time = re.search(r'fect_benchmark_results_(.*?)_([0-9A-Z]*)_([0-9A-Z]*_[0-9A-Z]*)_([0-9A-Z]*)_([0-9A-Z]*).json', p.name).groups()
with open(p, 'r') as f:
f1_score = json.load(f)['False']['f1-score']
data.append((model, prompt, ttc, f1_score))
df = pd.DataFrame(data, columns=['model', 'prompt', 'ttc', 'f1_score'])
print(df.sort_values('f1_score', ascending=False).round(2))
model prompt ttc f1_score 1 gpt-4.1-2025-04-14 3D WITH_TTC 0.86 4 gpt-4.1-2025-04-14 QUD NO_TTC 0.86 5 gpt-4.1-2025-04-14 QUD WITH_TTC 0.85 3 gpt-4.1-2025-04-14 BASIC WITH_TTC 0.82 0 gpt-4.1-2025-04-14 3D NO_TTC 0.79 2 gpt-4.1-2025-04-14 BASIC NO_TTC 0.79 6 gpt-4.1-mini-2025-04-14 3D NO_TTC 0.70 11 gpt-4.1-mini-2025-04-14 QUD WITH_TTC 0.66 7 gpt-4.1-mini-2025-04-14 3D WITH_TTC 0.63 10 gpt-4.1-mini-2025-04-14 QUD NO_TTC 0.63 8 gpt-4.1-mini-2025-04-14 BASIC NO_TTC 0.59 9 gpt-4.1-mini-2025-04-14 BASIC WITH_TTC 0.53 12 gpt-4.1-nano-2025-04-14 3D NO_TTC 0.52 16 gpt-4.1-nano-2025-04-14 QUD NO_TTC 0.49 14 gpt-4.1-nano-2025-04-14 BASIC NO_TTC 0.39 13 gpt-4.1-nano-2025-04-14 3D WITH_TTC 0.35 15 gpt-4.1-nano-2025-04-14 BASIC WITH_TTC 0.31 17 gpt-4.1-nano-2025-04-14 QUD WITH_TTC 0.24
# Create facets for each model
g = sns.FacetGrid(df, col='model', col_wrap=3, height=4, aspect=1)
# Map barplot to each facet
def plot_bars(data, **kwargs):
ax = plt.gca()
sns.barplot(
data=data,
x='prompt',
y='f1_score',
hue='ttc',
alpha=0.8,
ax=ax
)
ax.set_title(f"Model: {data['model'].iloc[0]}")
ax.set_xlabel('Prompt')
ax.set_ylabel('F1 Score')
g.map_dataframe(plot_bars)
g.add_legend(title='Test-Time Compute (TTC)', bbox_to_anchor=(1.0, 0.8), frameon=True, edgecolor='black')
plt.tight_layout()
plt.show()
The initial comparison suggests that we might be onto something here: the QUD prompt
does significantly better than the BASIC
prompt, and is comparable to the 3D paradigm
prompt.
Note that the QUD prompt didn't dissect the claim at all, so it doesn't benefit from identifying sub-claims like the 3D paradigm does.
Directing the LLM to identify the (often implicit) QUDs behind each utterance in the transcript appears to help it verify implicit claims. Since much of Cresta's product suite (behind the AI Analyst that's being evaluated here) applies downstream models to conversational transcripts (either complete transcripts or ones that are unfolding in real time), the results here suggest that augmenting raw transcripts by parsing them within the QUD framework may benefit other downstream parts of the system far behind the evaluation of claims made by the AI Analyst.
We compared a new QUD-based prompt to the 3D paradigm proposed by Shin et al. (2025) and found comparable F1 scores across 3 non-reasoning LLMs. While this is encouraging, it is important to note that this prompt breaks the alignment between the LLM and the human annotators who were instructed to follow the 3D paradigm.
However, according to the theory behind the QUD prompt, human comprehenders inadvertently interpret all natural language utterances as the answer to an often implicit QUD, meaning that instructing the LLM to do so as well may in theory lead to increased alignment in its own right. (It would be interesting to see if QUD-based instructions to human annotators could actually raise the inter-annotator agreement; I wouldn't be surprised!)
To shed light on the issue of alignment, I will analyze the LLM's test-time compute reasoning, comparing specifically the 3D prompt and the QUD prompt on the largest model I tested (gpt-4.1-2025-04-14
).