AI Forecasting

Hewitt, L.*, Ashokkumar, A*., Ghezae, I., & Willer, R.
Working Paper
To evaluate whether large language models (LLMs) can be leveraged to predict the results of social science experiments, we built an archive of 70 pre-registered, nationally representative survey experiments conducted in the United States, involving 476 experimental treatment effects and 105,165 participants. We prompted an advanced, publicly available LLM (GPT-4) to simulate how representative samples of Americans would respond to the stimuli from these experiments. Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters. Accuracy remained high for unpublished studies that could not appear in the model’s training data (r = 0.90). We further assessed predictive accuracy across demographic subgroups, various disciplines, and in nine recent megastudies featuring an additional 346 treatment effects. Together, our results suggest LLMs can augment experimental methods in science and practice, but also highlight important limitations and risks of misuse.
Media Coverage: HAI: LLM-Aided Social Science
Experimental Results Forecaster Demo >>
This demo accompanies the paper Prediction of Social Science Experimental Results Using Large Language Models and can be used for predicting experimental treatment effects on U.S. adults. To manage costs of hosting this demo publicly, it uses GPT-4o-mini rather than GPT-4.

Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., Willer, R., Liang, P., & Bernstein, M. S.
Working Paper
The promise of human behavioral simulation—general-purpose computational agents that replicate human behavior across domains—could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals—applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions. This work provides a foundation for new tools that can help investigate individual and collective behavior.

Jackson, M. O., Mei, Q., Wang, S., Xie, Y., Yuan, W., Benzell, S. G., Brynjolfsson, E., Camerer, C. F., Evans, J. A., Jabarian, B., Kleinberg, J., Meng, J., Mullainathan, S., Ozdaglar, A. E., Pfeiffer, T., Tennenholtz, M., Willer, R., Yang, D., & Ye, T.
Working Paper
We discuss the three main areas comprising the new and emerging eld of "AI Behavioral Science". This includes not only how AI can enhance research in the behavioral sciences, but also how the behavioral sciences can be used to study and better design AI and to understand how the world will change as AI and humans interact in increasingly layered and complex ways.