← Back to Projects

Coherent Value Systems: AITA Experiment

August 2, 2025

1. Introduction

I saw this tweet a little while back:

This made me curious about the value systems that LLMs have, specifically when it comes to assigning individuals blame in real situations. I decided to set aside some time over the weekend to vibecode and build out a small experiment to try and understand this a little better.

This experiment uses the "Am I the Asshole?" (AITA) subreddit as a testing ground for AI moral judgment. AITA provides thousands of real-world moral disputes where people vote on who's right and wrong - perfect for comparing AI decisions against human consensus.

GitHub Repository: latent-ethics-llms

The experiment tests AI models in three different ways:

  1. Direct judgment accuracy – does the model's first-pass verdict match the crowd majority
  2. Adversarial debate resilience – when two copies of the model argue opposite sides, and a third acts as judge, does deliberation converge toward the human consensus or away from it?
  3. Framing sensitivity – do verdicts flip when the dilemma is rewritten from "I did X" to "someone did X to me," holding facts constant?

2. Framework

2.1 Moral Philosophical Scope

Contemporary ethics in practicality are typically propped up by three primary pillars:

None of these philosophical frameworks perfectly capture how people actually make moral decisions in everyday life. Psychology research shows that most people rely on intuitive feelings about fairness, harm, loyalty, and similar values. This project isn't trying to determine absolute moral truth - it's simply comparing how AI models behave versus what happens when lots of real people judge the same situation.

The AITA dataset forms a descriptive-utilitarian proxy. Posts on Am I the Asshole? (AITA) subreddit receives thousands of crowd votes assigning one of four labels:

Although the vote does not constitute an objective moral truth, it approximates a descriptive-utilitarian stance: the winning label is the option that maximises expressed social approval within the crowd.

2.1.1 Assumptions and Limitations

This experiment relies on two key assumptions:

  1. Human consensus as baseline: The AITA majority vote serves as the reference point for "aligned" behavior. When AI models disagree with this consensus, it suggests potential misalignment, regardless of what philosophical theory might justify the AI's different judgment.

  2. Crowd wisdom: When thousands of people vote, individual biases tend to average out, giving a reasonably good picture of social consensus. However, this is still limited to Reddit users (mostly US-based) and may not represent global moral views.

3. Dataset

I built the dataset by scraping posts from r/AmItheAsshole using Reddit's API. The selection includes popular posts from different time periods to get a good mix of topics and scenarios.

The final dataset contains 2,973 unique AITA posts with 81,608 total comments (about 27 comments per post on average). Each post gets labeled based on the most upvoted judgments from Reddit users:

Judgment CategoryNumber of Comments
Not the Asshole (NTA)47,574
You're the Asshole (YTA)7,553
Everyone Sucks Here (ESH)2,329
No Assholes Here (NAH)1,543
Additional Information Requested (INFO)2,020

Each entry includes the original post, metadata like comment counts, and the breakdown of how people voted. This setup makes it possible to compare AI judgments directly with human consensus and spot patterns in how AI systems handle moral reasoning.

4. Methodology

4.1 Model Suite and Baselines

Currently, this study focuses on GPT-4o (OpenAI) as the primary evaluation target. The analysis presented in this work represents results from GPT-4o across all experimental conditions.

Future Plans:

To get a fuller picture of how different AI systems handle moral reasoning, I plan to test additional models with different designs and training approaches:

These models come from different companies with different training philosophies and constraints, which should reveal how various approaches to AI development affect moral reasoning. Each model will go through the same four tests to ensure fair comparisons.

4.2 Experiment 1 - Direct Moral Judgement

What I'm Testing

How well does GPT-4o match human moral judgments when given raw AITA scenarios?

How It Works

How I Measure Success

A=i=1N1(JLLM,i=JReddit,i)NA = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{LLM,i} = J_{Reddit,i})}{N}

4.3 Experiment 2 - Adversarial Reasoning & Debate

What I'm Testing

Does debate between AI models lead to better or worse moral judgments? When AIs argue both sides of a case and a third AI judges, do they get closer to or further from human consensus?

How It Works

I set up three roles:

The process:

  1. Pro and Con each write their best arguments for their assigned side
  2. The Judge reads both arguments and decides who was right
  3. I track both the quality of arguments and whether the final decision matches what humans said

How I Measure Success

Argument Quality: Human evaluators rate each argument from 1-5

Overall Debate Quality: Average of Pro and Con argument scores

Convergence Rate: How often the Judge agrees with Reddit majority

C=i=1N1(JJudge,i=JReddit,i)NC = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{Judge,i} = J_{Reddit,i})}{N}

4.4 Experiment 3 - Framing Effects

What I'm Testing

How much does storytelling perspective affect AI moral judgments? If I take the same situation but tell it from a "victim" perspective instead of an "actor" perspective, does the AI's judgment change?

How It Works

I use GPT-4o to rewrite AITA posts in two different ways while keeping all the facts the same:

Original Framing: Standard AITA style - "I did X" (focusing on the poster's actions) Victim Framing: "This happened to me" - focusing on how others' actions affected the poster

The process:

  1. Transform each AITA post using the victim framing prompt
  2. Show both versions (original and victim-framed) to GPT-4o
  3. Record what judgment it gives for each version
  4. Compare the results

How I Measure Success

Framing Sensitivity: How often does changing the perspective change the judgment?

F=i=1N1(Jvictim,iJoriginal,i)NF = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{victim,i} \neq J_{original,i})}{N}

Bias Direction: Does victim framing make the AI more lenient or more harsh on average? (Using a scale where NTA=1, NAH=2, ESH=3, YTA=4)

Human Alignment: How often do victim-framed judgments match the original Reddit consensus?

Cvictim=i=1N1(Jvictim,i=JReddit,i)NC_{victim} = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{victim,i} = J_{Reddit,i})}{N}

The process:

  1. Transform each AITA post using the empathy enhancement prompt
  2. Show both versions (original and empathy-enhanced) to GPT-4o
  3. Record what judgment it gives for each version
  4. Compare the results

5. Results

5.1 Experiment 1 — Direct Moral Judgment

The bottom line: GPT-4o agreed with Reddit's majority judgment 83.4% of the time - pretty good overall, but with some clear patterns.

How well it matched humans by judgment type:

What the AI actually said vs. what Reddit said:

The Pattern: GPT-4o really doesn't like calling individual people assholes. Instead, it either says everyone sucks or no one does.

5.2 Experiment 2 — Adversarial Reasoning & Debate

The bottom line: When AIs debated and then judged, they got WORSE at matching human consensus - only 39.7% agreement (compared to 83.4% for direct judgment).

Argument quality: The debates were actually pretty good - Pro arguments averaged 4.18/5, Con arguments 3.90/5, overall debate quality 4.04/5.

But the outcomes were weird:

How often the Judge agreed with Reddit by category:

The Problem: Debate made the AI judge even more likely to say "everyone sucks here" regardless of what the situation actually was. Good arguments, bad conclusions.

5.3 Experiment 3 — Framing Effects (Victim Reframing)

The Bottom Line: Just changing how the story was told (same facts, different perspective) changed GPT-4o's judgment 36.8% of the time. That's more than 1 in 3 cases.

The Damage to Accuracy:

What changed (Original → Victim perspective):

The pattern: When told from a victim's perspective, GPT-4o became much more sympathetic but also much less accurate. It started seeing most conflicts as "no one's really wrong here" even when Reddit clearly thought someone was being an asshole.

6. What This All Means

6.1 The Big Patterns

GPT-4o really doesn't like blaming people: Across every test, the same pattern emerged - GPT-4o consistently avoids calling individuals assholes. It heavily under-uses "YTA" judgments and compensates by saying either "everyone sucks" or "no one's wrong." This suggests the model has been trained to avoid conflict and spread blame around rather than hold individuals accountable.

Storytelling perspective matters way too much: Just changing how a story is told - from "I did X" to "X happened to me" - changed judgments in over one-third of cases and destroyed accuracy. This suggests the AI is easily manipulated by emotional framing rather than focusing on the actual facts and ethics.

6.2 The Bottom Line

GPT-4o has a consistent moral blind spot: it's trained to be so diplomatic and conflict-averse that it fails to hold people accountable when they genuinely deserve it. This isn't random - it happens predictably across different types of tests, suggesting it's baked into how the model was trained rather than just a quirk of specific scenarios.

For applications involving moral judgment, conflict resolution, or accountability decisions, this bias could seriously undermine the AI's usefulness and more scarily, it could be used to foster a culture of victimhood and blame avoidance.