LLM Model Performance Leaderboard
Comparative analysis of various LLM models across different research domains, measuring their performance on key statistical metrics.
Methodology
The leaderboard is based on replicating published research studies that use stated preference surveysβtools designed to understand how people make choices based on specific attributes.
We begin by reconstructing the original survey, using the same set of attributes used in the study. We also recreate the demographic profile of the original respondents as closely as possible, based on the information reported in the paper.
We then prompt a large language model (LLM) to simulate responses by adopting the perspectives of individuals with matching demographic characteristics. The LLM generates responses to the survey as if it were a participant in the original study.
Finally, we estimate the same model used in the original research based on the AI-generated responses and compare the results to the original human data. This allows us to assess the extent to which AI can replicate human decision-making in stated preference contexts.
Filter by Research Domain
Domain-Specific Leaderboards
Detailed performance breakdowns for each research domain.
Public Health
Best Performing LLMs in Public Health
Overall Best Model
sonnet
Top performer in 2 metrics
Manhattan Similarity (%)
gemini flash
50.51Coverage Probability (%)
sonnet
45.79Directional Effect Matching (%)
sonnet
70.95Spearman Correlation (%)
o1
53.35Consumer Research
Best Performing LLMs in Consumer Research
Overall Best Model
gemini flash
Top performer in 2 metrics
Manhattan Similarity (%)
gemini flash
49.45Coverage Probability (%)
gpt4
46.02Directional Effect Matching (%)
sonnet
82.63Spearman Correlation (%)
gemini flash
63.63Economics
Best Performing LLMs in Economics
Overall Best Model
gpt4
Top performer in 3 metrics
Manhattan Similarity (%)
o1
50.19Coverage Probability (%)
gpt4
34.92Directional Effect Matching (%)
gpt4
78.74Spearman Correlation (%)
gpt4
51.56Agricultural Sciences
Best Performing LLMs in Agricultural Sciences
Overall Best Model
gpt4
Top performer in 3 metrics
Manhattan Similarity (%)
gpt4
51.45Coverage Probability (%)
sonnet
29.08Directional Effect Matching (%)
gpt4
53.48Spearman Correlation (%)
gpt4
43.50Business Administration
Best Performing LLMs in Business Administration
Overall Best Model
gpt4
Top performer in 3 metrics
Manhattan Similarity (%)
sonnet
50.35Coverage Probability (%)
gpt4
35.27Directional Effect Matching (%)
gpt4
73.22Spearman Correlation (%)
gpt4
59.00Join our awesome community
Share results, seek support and stay updated with new releases