1. Introduction

I saw this tweet a little while back:

View Tweet
— Billy Markus (@BillyM2k) January 16, 2025

This made me curious about the value systems that LLMs have, specifically when it comes to assigning individuals blame in real situations. I decided to set aside some time over the weekend to vibecode and build out a small experiment to try and understand this a little better.

This experiment uses the "Am I the Asshole?" (AITA) subreddit as a testing ground for AI moral judgment. AITA provides thousands of real-world moral disputes where people vote on who's right and wrong - perfect for comparing AI decisions against human consensus.

GitHub Repository: latent-ethics-llms

The experiment tests AI models in three different ways:

Direct judgment accuracy – does the model's first-pass verdict match the crowd majority
Adversarial debate resilience – when two copies of the model argue opposite sides, and a third acts as judge, does deliberation converge toward the human consensus or away from it?
Framing sensitivity – do verdicts flip when the dilemma is rewritten from "I did X" to "someone did X to me," holding facts constant?

2. Framework

2.1 Moral Philosophical Scope

Contemporary ethics in practicality are typically propped up by three primary pillars:

Utilitarism (rightness = aggregate welfare)
Deontology (rightness = rule or duty)
Virtue ethics (rightness = cultivation of admirable character traits)

None of these philosophical frameworks perfectly capture how people actually make moral decisions in everyday life. Psychology research shows that most people rely on intuitive feelings about fairness, harm, loyalty, and similar values. This project isn't trying to determine absolute moral truth - it's simply comparing how AI models behave versus what happens when lots of real people judge the same situation.

The AITA dataset forms a descriptive-utilitarian proxy. Posts on Am I the Asshole? (AITA) subreddit receives thousands of crowd votes assigning one of four labels:

Not the Asshole (NTA)
You're the Asshole (YTA)
Everyone Sucks Here (ESH)
No Assholes Here (NAH)

Although the vote does not constitute an objective moral truth, it approximates a descriptive-utilitarian stance: the winning label is the option that maximises expressed social approval within the crowd.

2.1.1 Assumptions and Limitations

This experiment relies on two key assumptions:

Human consensus as baseline: The AITA majority vote serves as the reference point for "aligned" behavior. When AI models disagree with this consensus, it suggests potential misalignment, regardless of what philosophical theory might justify the AI's different judgment.
Crowd wisdom: When thousands of people vote, individual biases tend to average out, giving a reasonably good picture of social consensus. However, this is still limited to Reddit users (mostly US-based) and may not represent global moral views.

3. Dataset

I built the dataset by scraping posts from r/AmItheAsshole using Reddit's API. The selection includes popular posts from different time periods to get a good mix of topics and scenarios.

The final dataset contains 2,973 unique AITA posts with 81,608 total comments (about 27 comments per post on average). Each post gets labeled based on the most upvoted judgments from Reddit users:

Judgment Category	Number of Comments
Not the Asshole (NTA)	47,574
You're the Asshole (YTA)	7,553
Everyone Sucks Here (ESH)	2,329
No Assholes Here (NAH)	1,543
Additional Information Requested (INFO)	2,020

Each entry includes the original post, metadata like comment counts, and the breakdown of how people voted. This setup makes it possible to compare AI judgments directly with human consensus and spot patterns in how AI systems handle moral reasoning.

4. Methodology

4.1 Model Suite and Baselines

Currently, this study focuses on GPT-4o (OpenAI) as the primary evaluation target. The analysis presented in this work represents results from GPT-4o across all experimental conditions.

Future Plans:

To get a fuller picture of how different AI systems handle moral reasoning, I plan to test additional models with different designs and training approaches:

Claude Sonnet (Anthropic)
Grok (xAI)
DeepSeek (DeepSeek AI)

These models come from different companies with different training philosophies and constraints, which should reveal how various approaches to AI development affect moral reasoning. Each model will go through the same four tests to ensure fair comparisons.

4.2 Experiment 1 - Direct Moral Judgement

What I'm Testing

How well does GPT-4o match human moral judgments when given raw AITA scenarios?

How It Works

Give the AI clear instructions about the AITA format and judgment categories
Feed it real AITA posts without any modifications
Record what judgment it makes: NTA, YTA, ESH, or NAH

How I Measure Success

Agreement rate between AI and Reddit majority:

A = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{LLM,i} = J_{Reddit,i})}{N}

4.3 Experiment 2 - Adversarial Reasoning & Debate

What I'm Testing

Does debate between AI models lead to better or worse moral judgments? When AIs argue both sides of a case and a third AI judges, do they get closer to or further from human consensus?

How It Works

I set up three roles:

Pro: Argues that the original poster is NOT the asshole
Con: Argues that the original poster IS the asshole
Judge: Reads both arguments and makes the final call

The process:

Pro and Con each write their best arguments for their assigned side
The Judge reads both arguments and decides who was right
I track both the quality of arguments and whether the final decision matches what humans said

How I Measure Success

Argument Quality: Human evaluators rate each argument from 1-5

Overall Debate Quality: Average of Pro and Con argument scores

Convergence Rate: How often the Judge agrees with Reddit majority

C = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{Judge,i} = J_{Reddit,i})}{N}

4.4 Experiment 3 - Framing Effects

What I'm Testing

How much does storytelling perspective affect AI moral judgments? If I take the same situation but tell it from a "victim" perspective instead of an "actor" perspective, does the AI's judgment change?

How It Works

I use GPT-4o to rewrite AITA posts in two different ways while keeping all the facts the same:

Original Framing: Standard AITA style - "I did X" (focusing on the poster's actions) Victim Framing: "This happened to me" - focusing on how others' actions affected the poster

The process:

Transform each AITA post using the victim framing prompt
Show both versions (original and victim-framed) to GPT-4o
Record what judgment it gives for each version
Compare the results

How I Measure Success

Framing Sensitivity: How often does changing the perspective change the judgment?

F = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{victim,i} \neq J_{original,i})}{N}

Bias Direction: Does victim framing make the AI more lenient or more harsh on average? (Using a scale where NTA=1, NAH=2, ESH=3, YTA=4)

Human Alignment: How often do victim-framed judgments match the original Reddit consensus?

C_{victim} = \frac{\sum_{i=1}^{N} \mathbf{1}(J_{victim,i} = J_{Reddit,i})}{N}

The process:

Transform each AITA post using the empathy enhancement prompt
Show both versions (original and empathy-enhanced) to GPT-4o
Record what judgment it gives for each version
Compare the results

5. Results

5.1 Experiment 1 — Direct Moral Judgment

The bottom line: GPT-4o agreed with Reddit's majority judgment 83.4% of the time - pretty good overall, but with some clear patterns.

How well it matched humans by judgment type:

NTA (Not the Asshole): 90.9% - almost always agreed when Reddit said someone wasn't wrong
ESH (Everyone Sucks Here): 71.4% - decent agreement on complex situations
YTA (You're the Asshole): 18.6% - really struggled to call people assholes
NAH (No Assholes Here): 0.0% - never agreed when Reddit said no one was wrong

What the AI actually said vs. what Reddit said:

NTA: Reddit 88.7% → AI 85.0% (close match)
YTA: Reddit 9.4% → AI 2.4% (way too lenient)
ESH: Reddit 1.5% → AI 9.8% (blamed everyone way more often)
NAH: Reddit 0.4% → AI 2.8% (saw neutral situations more often)

The Pattern: GPT-4o really doesn't like calling individual people assholes. Instead, it either says everyone sucks or no one does.

5.2 Experiment 2 — Adversarial Reasoning & Debate

The bottom line: When AIs debated and then judged, they got WORSE at matching human consensus - only 39.7% agreement (compared to 83.4% for direct judgment).

Argument quality: The debates were actually pretty good - Pro arguments averaged 4.18/5, Con arguments 3.90/5, overall debate quality 4.04/5.

But the outcomes were weird:

ESH (Everyone Sucks): 50.5% of all judgments - way overused
NTA: 35.7%
NAH: 8.9%
YTA: 4.8% - even more reluctant to blame individuals after debate

How often the Judge agreed with Reddit by category:

ESH: 85.7% - good when Reddit actually said everyone sucks
NAH: 50.0% - okay on neutral cases
NTA: 40.0% - worse than direct judgment
YTA: 27.9% - terrible at identifying individual blame

The Problem: Debate made the AI judge even more likely to say "everyone sucks here" regardless of what the situation actually was. Good arguments, bad conclusions.

5.3 Experiment 3 — Framing Effects (Victim Reframing)

The Bottom Line: Just changing how the story was told (same facts, different perspective) changed GPT-4o's judgment 36.8% of the time. That's more than 1 in 3 cases.

The Damage to Accuracy:

Original framing: 81.7% agreement with Reddit
Victim framing: 53.8% agreement with Reddit
Net result: 29.4% of cases got WORSE, only 1.5% got better

What changed (Original → Victim perspective):

NTA: 82.8% → 54.5% (way fewer clear "not the asshole" calls)
NAH: 7.2% → 39.4% (massive increase in "no one's wrong")
ESH: 7.4% → 5.4% (slightly less "everyone sucks")
YTA: 2.6% → 0.7% (almost never called anyone an asshole)

The pattern: When told from a victim's perspective, GPT-4o became much more sympathetic but also much less accurate. It started seeing most conflicts as "no one's really wrong here" even when Reddit clearly thought someone was being an asshole.

6. What This All Means

6.1 The Big Patterns

GPT-4o really doesn't like blaming people: Across every test, the same pattern emerged - GPT-4o consistently avoids calling individuals assholes. It heavily under-uses "YTA" judgments and compensates by saying either "everyone sucks" or "no one's wrong." This suggests the model has been trained to avoid conflict and spread blame around rather than hold individuals accountable.

Storytelling perspective matters way too much: Just changing how a story is told - from "I did X" to "X happened to me" - changed judgments in over one-third of cases and destroyed accuracy. This suggests the AI is easily manipulated by emotional framing rather than focusing on the actual facts and ethics.

6.2 The Bottom Line

GPT-4o has a consistent moral blind spot: it's trained to be so diplomatic and conflict-averse that it fails to hold people accountable when they genuinely deserve it. This isn't random - it happens predictably across different types of tests, suggesting it's baked into how the model was trained rather than just a quirk of specific scenarios.

For applications involving moral judgment, conflict resolution, or accountability decisions, this bias could seriously undermine the AI's usefulness and more scarily, it could be used to foster a culture of victimhood and blame avoidance.

Coherent Value Systems: AITA Experiment

1. Introduction

2. Framework

2.1 Moral Philosophical Scope

2.1.1 Assumptions and Limitations

3. Dataset

4. Methodology

4.1 Model Suite and Baselines

4.2 Experiment 1 - Direct Moral Judgement

What I'm Testing

How It Works

How I Measure Success

4.3 Experiment 2 - Adversarial Reasoning & Debate

What I'm Testing

How It Works

How I Measure Success

4.4 Experiment 3 - Framing Effects

What I'm Testing

How It Works

How I Measure Success

5. Results

5.1 Experiment 1 — Direct Moral Judgment

5.2 Experiment 2 — Adversarial Reasoning & Debate

5.3 Experiment 3 — Framing Effects (Victim Reframing)

6. What This All Means

6.1 The Big Patterns

6.2 The Bottom Line