LLM Model Performance Leaderboard

Comparative analysis of various LLM models across different research domains, measuring their performance on key statistical metrics.

Methodology

The leaderboard is based on replicating published research studies that use stated preference surveysβ€”tools designed to understand how people make choices based on specific attributes.

We begin by reconstructing the original survey, using the same set of attributes used in the study. We also recreate the demographic profile of the original respondents as closely as possible, based on the information reported in the paper.

We then prompt a large language model (LLM) to simulate responses by adopting the perspectives of individuals with matching demographic characteristics. The LLM generates responses to the survey as if it were a participant in the original study.

Finally, we estimate the same model used in the original research based on the AI-generated responses and compare the results to the original human data. This allows us to assess the extent to which AI can replicate human decision-making in stated preference contexts.

Filter by Research Domain

Scroll horizontally to see all columns β†’
LLM Model
Domain
Manhattan Similarity (%)↓
Coverage Probability (%)
Directional Effect Matching (%)
Spearman Correlation (%)
gpt4Agricultural Sciences51.45πŸ₯‡24.3053.48πŸ₯‡43.50πŸ₯‡
gemini flashPublic Health50.51πŸ₯‡25.3768.93πŸ₯ˆ53.09πŸ₯ˆ
sonnetBusiness Administration50.35πŸ₯‡15.6368.58πŸ₯ˆ44.00πŸ₯‰
o1Economics50.19πŸ₯‡18.6765.1919.56
gemini flashConsumer Research49.45πŸ₯‡23.8481.59πŸ₯ˆ63.63πŸ₯‡
sonnetPublic Health49.23πŸ₯ˆ45.79πŸ₯‡70.95πŸ₯‡52.57πŸ₯‰
haikuBusiness Administration49.20πŸ₯ˆ18.51πŸ₯‰62.30πŸ₯‰46.00πŸ₯ˆ
geminiBusiness Administration48.70πŸ₯‰29.16πŸ₯ˆ60.1034.50
sonnetAgricultural Sciences48.60πŸ₯ˆ29.08πŸ₯‡47.40πŸ₯ˆ29.00πŸ₯ˆ
gpt3-instructConsumer Research48.43πŸ₯ˆ41.50πŸ₯ˆ75.9745.28
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best
Showing 1 - 10 of 32 entries

Domain-Specific Leaderboards

Detailed performance breakdowns for each research domain.

Public Health

Scroll horizontally to see all columns β†’
LLM Model↓
Manhattan Similarity (%)↓
Coverage Probability (%)↓
Directional Effect Matching (%)↓
Spearman Correlation (%)↓
gemini flash50.51πŸ₯‡25.3768.93πŸ₯ˆ53.09πŸ₯ˆ
sonnet49.23πŸ₯ˆ45.79πŸ₯‡70.95πŸ₯‡52.57πŸ₯‰
haiku47.97πŸ₯‰44.93πŸ₯ˆ65.8342.48
o146.2225.6768.72πŸ₯‰53.35πŸ₯‡
gpt3-instruct44.5041.1252.2717.86
gpt443.1743.12πŸ₯‰67.5947.04
gpt o341.2221.0555.2132.55
gemini40.1940.4755.4222.78
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best

Best Performing LLMs in Public Health

Overall Best Model

πŸ₯‡

sonnet

Top performer in 2 metrics

Manhattan Similarity (%)

πŸ₯‡

gemini flash

50.51

Coverage Probability (%)

πŸ₯‡

sonnet

45.79

Directional Effect Matching (%)

πŸ₯‡

sonnet

70.95

Spearman Correlation (%)

πŸ₯‡

o1

53.35

Consumer Research

Scroll horizontally to see all columns β†’
LLM Model↓
Manhattan Similarity (%)↓
Coverage Probability (%)↓
Directional Effect Matching (%)↓
Spearman Correlation (%)↓
gemini flash49.45πŸ₯‡23.8481.59πŸ₯ˆ63.63πŸ₯‡
gpt3-instruct48.43πŸ₯ˆ41.50πŸ₯ˆ75.9745.28
haiku48.18πŸ₯‰39.9276.6049.79
sonnet46.7140.07πŸ₯‰82.63πŸ₯‡57.74πŸ₯ˆ
o146.0325.3375.2056.05πŸ₯‰
gpt444.6246.02πŸ₯‡78.21πŸ₯‰53.63
gemini42.9237.1274.8145.21
gpt o335.4817.9156.5922.47
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best

Best Performing LLMs in Consumer Research

Overall Best Model

πŸ₯‡

gemini flash

Top performer in 2 metrics

Manhattan Similarity (%)

πŸ₯‡

gemini flash

49.45

Coverage Probability (%)

πŸ₯‡

gpt4

46.02

Directional Effect Matching (%)

πŸ₯‡

sonnet

82.63

Spearman Correlation (%)

πŸ₯‡

gemini flash

63.63

Economics

Scroll horizontally to see all columns β†’
LLM Model↓
Manhattan Similarity (%)↓
Coverage Probability (%)↓
Directional Effect Matching (%)↓
Spearman Correlation (%)↓
o150.19πŸ₯‡18.6765.1919.56
gpt3-instruct47.97πŸ₯ˆ21.5875.7934.78
gemini47.29πŸ₯‰33.46πŸ₯ˆ76.8346.78πŸ₯‰
sonnet46.5428.5477.55πŸ₯‰50.67πŸ₯ˆ
gpt445.4634.92πŸ₯‡78.74πŸ₯‡51.56πŸ₯‡
gemini flash44.3123.8177.87πŸ₯ˆ43.89
haiku39.1731.17πŸ₯‰73.5731.78
gpt o337.204.7763.3527.44
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best

Best Performing LLMs in Economics

Overall Best Model

πŸ₯‡

gpt4

Top performer in 3 metrics

Manhattan Similarity (%)

πŸ₯‡

o1

50.19

Coverage Probability (%)

πŸ₯‡

gpt4

34.92

Directional Effect Matching (%)

πŸ₯‡

gpt4

78.74

Spearman Correlation (%)

πŸ₯‡

gpt4

51.56

Agricultural Sciences

Scroll horizontally to see all columns β†’
LLM Model↓
Manhattan Similarity (%)↓
Coverage Probability (%)↓
Directional Effect Matching (%)↓
Spearman Correlation (%)↓
gpt451.45πŸ₯‡24.3053.48πŸ₯‡43.50πŸ₯‡
sonnet48.60πŸ₯ˆ29.08πŸ₯‡47.40πŸ₯ˆ29.00πŸ₯ˆ
haiku46.10πŸ₯‰28.30πŸ₯‰43.3212.50πŸ₯‰
gemini43.5028.98πŸ₯ˆ44.80πŸ₯‰0.00
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best

Best Performing LLMs in Agricultural Sciences

Overall Best Model

πŸ₯‡

gpt4

Top performer in 3 metrics

Manhattan Similarity (%)

πŸ₯‡

gpt4

51.45

Coverage Probability (%)

πŸ₯‡

sonnet

29.08

Directional Effect Matching (%)

πŸ₯‡

gpt4

53.48

Spearman Correlation (%)

πŸ₯‡

gpt4

43.50

Business Administration

Scroll horizontally to see all columns β†’
LLM Model↓
Manhattan Similarity (%)↓
Coverage Probability (%)↓
Directional Effect Matching (%)↓
Spearman Correlation (%)↓
sonnet50.35πŸ₯‡15.6368.58πŸ₯ˆ44.00πŸ₯‰
haiku49.20πŸ₯ˆ18.51πŸ₯‰62.30πŸ₯‰46.00πŸ₯ˆ
gemini48.70πŸ₯‰29.16πŸ₯ˆ60.1034.50
gpt448.0035.27πŸ₯‡73.22πŸ₯‡59.00πŸ₯‡
πŸ₯‡Top performer
πŸ₯ˆSecond best
πŸ₯‰Third best

Best Performing LLMs in Business Administration

Overall Best Model

πŸ₯‡

gpt4

Top performer in 3 metrics

Manhattan Similarity (%)

πŸ₯‡

sonnet

50.35

Coverage Probability (%)

πŸ₯‡

gpt4

35.27

Directional Effect Matching (%)

πŸ₯‡

gpt4

73.22

Spearman Correlation (%)

πŸ₯‡

gpt4

59.00
Discord Logo

Join our awesome community

Share results, seek support and stay updated with new releases