The model student: GPT-4 performance on graduate biomedical science exams

  • 0Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, 32610, USA. ds@ufl.edu.

|

|

Summary

This summary is machine-generated.

Large language models like GPT-4 show strong performance on biomedical science exams, outscoring students on many. However, limitations in handling figures and potential plagiarism require careful consideration for academic use.

Area Of Science

  • Biomedical Sciences
  • Artificial Intelligence
  • Educational Technology

Background

  • Large language models (LLMs) like GPT-4 and ChatGPT are increasingly capable text generators.
  • GPT-4 has demonstrated proficiency in standardized tests, but its trustworthiness in diverse knowledge domains needs evaluation.
  • Assessing AI performance in specialized fields like biomedical sciences is crucial for understanding its potential and limitations.

Purpose Of The Study

  • To evaluate the performance and accuracy of the GPT-4 large language model on graduate-level biomedical science examinations.
  • To identify specific question formats and content types where GPT-4 excels or struggles.
  • To inform the future design of academic assessments in the context of advanced AI tools.

Main Methods

  • GPT-4 was tested on nine graduate-level biomedical science exams, including seven that were blinded.
  • Performance was analyzed across different question types: fill-in-the-blank, short-answer, essay, and questions involving figures.
  • Responses were assessed for accuracy, plagiarism, and instances of hallucination.

Main Results

  • GPT-4 surpassed the student average score in seven out of nine exams and exceeded all student scores in four exams.
  • The model performed well on text-based questions and questions with figures from published manuscripts.
  • Poor performance was observed on questions with figures containing simulated data and those requiring hand-drawn answers. Plagiarism and hallucinations were noted in some responses.

Conclusions

  • GPT-4 demonstrates significant capabilities in answering graduate-level biomedical science questions, often outperforming human students.
  • The model's limitations, particularly with visual data and potential for generating inaccurate or plagiarized content, highlight the need for careful integration into academic settings.
  • Future academic assessments may need adaptation to account for AI capabilities and mitigate potential misuse.