Skip to content

Behavioral
fidelity
leaderboard

This benchmark shows which models actually hold up against published studies of human choice.

Request methodology↗︎

Evidence before adoption

Why this benchmark exists

Most benchmarks reward style, speed, or benchmark gaming. We care about whether a system can reproduce the structure of human choice.

That means comparing simulated outcomes to published studies and scoring the same decision models researchers used in the original work.

The point is not to crown a winner for a week. The point is to show what these systems can actually support.

If you cannot measure behavioral fidelity, you cannot responsibly deploy a behavioral system.

417Replicated study runs
5Research domains
93%Accuracy vs real human decisions
OpenPublished methodology
SignalValueInterpretation
Replicated study runs417Published and reconstructed experiments currently represented in the benchmark dataset
Domains covered5Public Health, Consumer Research, Economics, Agricultural Sciences, and Business Administration
Top weighted Spearman correlation0.55Best current cross-domain rank-order alignment in the dataset
Top weighted parameter sign agreement75.4%Best current match on coefficient direction across benchmarked studies

Transparent measurement

Benchmark methodology

The leaderboard is built by replicating published studies of human choice.

We reconstruct the original survey, match the respondent profile, estimate the same models, and compare the results back to human data.

Every score is attached to a documented protocol.

Live rankings

ModelScoreCoverageMatchStudies
Claude Sonnet (Databricks)56.7840.0782.6319
GPT-4 (Azure OpenAI)55.6246.0278.2119
Claude Sonnet (Databricks)54.6345.7970.9523
Gemini Flash (Google)54.6323.8481.5919
GPT-4 (Azure OpenAI)53.8735.2773.222
Claude Haiku (Databricks)53.6239.9276.6019
GPT-3 Instruct (Azure OpenAI)52.7941.5075.9718
GPT-4 (Azure OpenAI)52.6734.9278.749
Gemini (Google)51.0933.4676.839
Claude Sonnet (Databricks)50.8228.5477.559
Similarity
Parameter proximity
Coverage
Confidence overlap
Match
Effect direction
Rank Corr.
Preference ordering
1 / 4

Domain breakdowns

Each table preserves the original domain-level ranking surface from the legacy leaderboard.

Public Health

8 models ranked

#public-health
ModelScoreCoverageMatchStudies
Claude Sonnet (Databricks)54.6345.7970.9523
Claude Haiku (Databricks)50.3044.9365.8323
GPT-4 (Azure OpenAI)50.2343.1267.5923
Gemini Flash (Google)49.4825.3768.9322
o1 (Azure OpenAI)48.4925.6768.7220
Gemini (Google)39.7140.4755.4223
GPT-3 Instruct (Azure OpenAI)38.9441.1252.2722
o3-mini (Azure OpenAI)37.5121.0555.2122
Similarity
Parameter proximity
Coverage
Confidence overlap

Consumer Research

8 models ranked

#consumer-research
ModelScoreCoverageMatchStudies
Claude Sonnet (Databricks)56.7840.0782.6319
GPT-4 (Azure OpenAI)55.6246.0278.2119
Gemini Flash (Google)54.6323.8481.5919
Claude Haiku (Databricks)53.6239.9276.6019
GPT-3 Instruct (Azure OpenAI)52.7941.5075.9718
o1 (Azure OpenAI)50.6525.3375.2019
Gemini (Google)50.0137.1274.8119
o3-mini (Azure OpenAI)33.1117.9156.5919
Similarity
Parameter proximity
Coverage
Confidence overlap

Economics

8 models ranked

#economics
ModelScoreCoverageMatchStudies
GPT-4 (Azure OpenAI)52.6734.9278.749
Gemini (Google)51.0933.4676.839
Claude Sonnet (Databricks)50.8228.5477.559
Gemini Flash (Google)47.4723.8177.879
GPT-3 Instruct (Azure OpenAI)45.0321.5875.799
Claude Haiku (Databricks)43.9231.1773.579
o1 (Azure OpenAI)38.4018.6765.199
o3-mini (Azure OpenAI)33.194.7763.359
Similarity
Parameter proximity
Coverage
Confidence overlap

Agricultural Sciences

4 models ranked

#agricultural-sciences
ModelScoreCoverageMatchStudies
GPT-4 (Azure OpenAI)43.1824.3053.482
Claude Sonnet (Databricks)38.5229.0847.402
Claude Haiku (Databricks)32.5628.3043.322
Gemini (Google)29.3228.9844.802
Similarity
Parameter proximity
Coverage
Confidence overlap

Business Administration

4 models ranked

#business-administration
ModelScoreCoverageMatchStudies
GPT-4 (Azure OpenAI)53.8735.2773.222
Claude Sonnet (Databricks)44.6415.6368.582
Claude Haiku (Databricks)44.0018.5162.302
Gemini (Google)43.1229.1660.102
Similarity
Parameter proximity
Coverage
Confidence overlap

// Scoring methodology

How we score

The method follows the original studies closely enough to preserve the research question, not the surface language alone.

01 ::

Original study reconstruction

We begin by reconstructing the original survey using the same set of attributes reported in the source study.

02 ::

Demographic profile matching

We recreate the demographic profile of the original respondents as closely as possible from the published paper.

03 ::

Matched respondent simulation

The model is prompted to answer as if it were a participant with the corresponding demographic characteristics.

04 ::

Human model parity

We estimate the same statistical model used in the original research on the AI-generated responses.

05 ::

Comparison to human data

Estimated parameters are compared back to the original human data to assess how well the model replicates real decision-making.

06 ::

Transparent publication

Results are summarized with enough methodological detail for research, legal, and procurement stakeholders to understand what is being claimed.

Frequently asked questions

// Leaderboard details

Questions about the leaderboard? Contact us at hello@subconscious.ai

01 ::

Is this the full interactive leaderboard?

This page presents the benchmark method and the highest-signal summary. We can share deeper methodological detail directly with qualified partners.

02 ::

Can we audit the methodology?

Yes. The benchmark is intentionally designed to be explainable to research, legal, procurement, and executive stakeholders.

03 ::

Why stated preference studies?

They are a rigorous way to compare how humans make trade-offs across defined attributes, which makes them useful for testing simulated decision-makers.

04 ::

What does a good score mean?

A good score means the model is preserving important structural properties of human choice in that domain. It does not mean the model is universally reliable without context.

05 ::

The homepage says 93% - the leaderboard's top score is 56.78. Which is right?

Both. 93% is replication accuracy: how often simulated studies reproduce the direction and outcome of the original human studies. The leaderboard reports stricter cross-domain statistics - rank correlation and parameter-sign agreement - for each foundation model. We publish both because a benchmark you can't interrogate isn't a benchmark.