Assessing Large Language Models in Building a Structured Dataset From AskDocs Subreddit Data: Methodological Study
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) effectively extract health information from social media, matching human accuracy. This validates LLMs for analyzing digital health communications and online user behavior.
Area Of Science
- Digital Health
- Natural Language Processing
- Computational Social Science
Background
- The subreddit r/AskDocs is a key platform for digital health consultations.
- Analyzing unstructured user-generated content from forums like r/AskDocs is challenging.
- Large language models (LLMs) offer advanced tools for extracting health information from social media.
Purpose Of The Study
- To evaluate the efficacy of LLMs in transforming unstructured r/AskDocs data into a structured format.
- To compare LLM data extraction performance against human annotators.
- To assess the alignment of LLM-based data extraction with human cognitive processes.
Main Methods
- Data extraction from 2800 r/AskDocs posts using human annotators (medical students) and LLMs.
- Human annotation included demographics, inquiry type, proxy relationship, chronic conditions, and consultation status.
- LLM data extraction utilized engineered prompts (JSON, few-shot) with models like Llama 3, Genna, and GPT; Cohen κ assessed inter-annotator reliability.
Main Results
- Llama 3 70B (7 few-shot examples) and GPT-4 (2 few-shot examples) achieved the highest accuracy (87.4%) against the human-annotated gold standard.
- Llama 3 70B demonstrated superior performance in coding health-related content.
- GPT-4 excelled in extracting demographic information from unstructured posts.
Conclusions
- LLMs demonstrate comparable performance to human annotators in extracting demographic and health information from social media health forums.
- This study validates LLMs as reliable tools for analyzing digital health communications.
- LLMs show potential for advancing methodologies in digital research by understanding online behaviors and interactions.

