Behavioral
fidelity
leaderboard

This benchmark shows which models actually hold up against published studies of human choice.

Evidence before adoption

Why this benchmark exists

Most benchmarks reward style, speed, or benchmark gaming. We care about whether a system can reproduce the structure of human choice.

That means comparing simulated outcomes to published studies and scoring the same decision models researchers used in the original work.

The point is not to crown a winner for a week. The point is to show what these systems can actually support.

If you cannot measure behavioral fidelity, you cannot responsibly deploy a behavioral system.

417Replicated study runs

5Research domains

93%Accuracy vs real human decisions

OpenPublished methodology

Signal	Value	Interpretation
Replicated study runs	417	Published and reconstructed experiments currently represented in the benchmark dataset
Domains covered	5	Public Health, Consumer Research, Economics, Agricultural Sciences, and Business Administration
Top weighted Spearman correlation	0.55	Best current cross-domain rank-order alignment in the dataset
Top weighted parameter sign agreement	75.4%	Best current match on coefficient direction across benchmarked studies

Transparent measurement

Benchmark methodology

The leaderboard is built by replicating published studies of human choice.

We reconstruct the original survey, match the respondent profile, estimate the same models, and compare the results back to human data.

Read the research↗︎

Every score is attached to a documented protocol.

Live rankings

Model	Score	Coverage	Match	Studies
Claude Sonnet (Databricks)	56.78	40.07	82.63	19
GPT-4 (Azure OpenAI)	55.62	46.02	78.21	19
Claude Sonnet (Databricks)	54.63	45.79	70.95	23
Gemini Flash (Google)	54.63	23.84	81.59	19
GPT-4 (Azure OpenAI)	53.87	35.27	73.22	2
Claude Haiku (Databricks)	53.62	39.92	76.60	19
GPT-3 Instruct (Azure OpenAI)	52.79	41.50	75.97	18
GPT-4 (Azure OpenAI)	52.67	34.92	78.74	9
Gemini (Google)	51.09	33.46	76.83	9
Claude Sonnet (Databricks)	50.82	28.54	77.55	9

Similarity

Parameter proximity

Coverage

Confidence overlap

Match

Effect direction

Rank Corr.

Preference ordering

1 / 4

Domain breakdowns

Each table preserves the original domain-level ranking surface from the legacy leaderboard.

Public Health

8 models ranked

#public-health

Model	Score	Coverage	Match	Studies
Claude Sonnet (Databricks)	54.63	45.79	70.95	23
Claude Haiku (Databricks)	50.30	44.93	65.83	23
GPT-4 (Azure OpenAI)	50.23	43.12	67.59	23
Gemini Flash (Google)	49.48	25.37	68.93	22
o1 (Azure OpenAI)	48.49	25.67	68.72	20
Gemini (Google)	39.71	40.47	55.42	23
GPT-3 Instruct (Azure OpenAI)	38.94	41.12	52.27	22
o3-mini (Azure OpenAI)	37.51	21.05	55.21	22

Similarity

Parameter proximity

Coverage

Confidence overlap

Consumer Research

8 models ranked

#consumer-research

Model	Score	Coverage	Match	Studies
Claude Sonnet (Databricks)	56.78	40.07	82.63	19
GPT-4 (Azure OpenAI)	55.62	46.02	78.21	19
Gemini Flash (Google)	54.63	23.84	81.59	19
Claude Haiku (Databricks)	53.62	39.92	76.60	19
GPT-3 Instruct (Azure OpenAI)	52.79	41.50	75.97	18
o1 (Azure OpenAI)	50.65	25.33	75.20	19
Gemini (Google)	50.01	37.12	74.81	19
o3-mini (Azure OpenAI)	33.11	17.91	56.59	19

Similarity

Parameter proximity

Coverage

Confidence overlap

Economics

8 models ranked

#economics

Model	Score	Coverage	Match	Studies
GPT-4 (Azure OpenAI)	52.67	34.92	78.74	9
Gemini (Google)	51.09	33.46	76.83	9
Claude Sonnet (Databricks)	50.82	28.54	77.55	9
Gemini Flash (Google)	47.47	23.81	77.87	9
GPT-3 Instruct (Azure OpenAI)	45.03	21.58	75.79	9
Claude Haiku (Databricks)	43.92	31.17	73.57	9
o1 (Azure OpenAI)	38.40	18.67	65.19	9
o3-mini (Azure OpenAI)	33.19	4.77	63.35	9

Similarity

Parameter proximity

Coverage

Confidence overlap

Agricultural Sciences

4 models ranked

#agricultural-sciences

Model	Score	Coverage	Match	Studies
GPT-4 (Azure OpenAI)	43.18	24.30	53.48	2
Claude Sonnet (Databricks)	38.52	29.08	47.40	2
Claude Haiku (Databricks)	32.56	28.30	43.32	2
Gemini (Google)	29.32	28.98	44.80	2

Similarity

Parameter proximity

Coverage

Confidence overlap

Business Administration

4 models ranked

#business-administration

Model	Score	Coverage	Match	Studies
GPT-4 (Azure OpenAI)	53.87	35.27	73.22	2
Claude Sonnet (Databricks)	44.64	15.63	68.58	2
Claude Haiku (Databricks)	44.00	18.51	62.30	2
Gemini (Google)	43.12	29.16	60.10	2

Similarity

Parameter proximity

Coverage

Confidence overlap

// Scoring methodology

How we score

The method follows the original studies closely enough to preserve the research question, not the surface language alone.

01 ::

Original study reconstruction

We begin by reconstructing the original survey using the same set of attributes reported in the source study.

02 ::

Demographic profile matching

We recreate the demographic profile of the original respondents as closely as possible from the published paper.

03 ::

Matched respondent simulation

The model is prompted to answer as if it were a participant with the corresponding demographic characteristics.

04 ::

Human model parity

We estimate the same statistical model used in the original research on the AI-generated responses.

05 ::

Comparison to human data

Estimated parameters are compared back to the original human data to assess how well the model replicates real decision-making.

06 ::

Transparent publication

Results are summarized with enough methodological detail for research, legal, and procurement stakeholders to understand what is being claimed.

Frequently asked questions

// Leaderboard details

Questions about the leaderboard? Contact us at hello@subconscious.ai

01 ::

Is this the full interactive leaderboard?

This page presents the benchmark method and the highest-signal summary. We can share deeper methodological detail directly with qualified partners.

02 ::

Can we audit the methodology?

Yes. The benchmark is intentionally designed to be explainable to research, legal, procurement, and executive stakeholders.

03 ::

Why stated preference studies?

They are a rigorous way to compare how humans make trade-offs across defined attributes, which makes them useful for testing simulated decision-makers.

04 ::

What does a good score mean?

A good score means the model is preserving important structural properties of human choice in that domain. It does not mean the model is universally reliable without context.

05 ::

The homepage says 93% - the leaderboard's top score is 56.78. Which is right?

Both. 93% is replication accuracy: how often simulated studies reproduce the direction and outcome of the original human studies. The leaderboard reports stricter cross-domain statistics - rank correlation and parameter-sign agreement - for each foundation model. We publish both because a benchmark you can't interrogate isn't a benchmark.

Behavioralfidelityleaderboard

Why this benchmark exists

Benchmark methodology

Domain breakdowns

Public Health

Consumer Research

Economics

Agricultural Sciences

Business Administration

How we score

Original study reconstruction

Demographic profile matching

Matched respondent simulation

Human model parity

Comparison to human data

Transparent publication

Frequently asked questions

Is this the full interactive leaderboard?

Can we audit the methodology?

Why stated preference studies?

What does a good score mean?

The homepage says 93% - the leaderboard's top score is 56.78. Which is right?

Behavioral
fidelity
leaderboard