Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jun 30, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

DataAtlas: automatic generation of data dictionaries using large language models.

Raffaele Giancotti1,2, Rajna Fani1,3, Rafi Al Attrach1,3

  • 1Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, United States.

JAMIA Open
|June 29, 2026
PubMed
Summary

Related Concept Videos

Language Development01:22

Language Development

Children master language quickly and with relative ease, supported by both biological predisposition and reinforcement. B. F. Skinner (1957) proposed that language is learned through reinforcement, while Noam Chomsky (1965) argued that language acquisition mechanisms are biologically determined.
The critical period for language acquisition suggests that the ability to acquire language is at its peak early in life. As people age, this proficiency decreases. Language development begins very...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Integrated Downstream Analysis and Epidemiological Modelling of Hantavirus Infection: From Host Transcriptomics to Transmission Dynamics.

Pathogens (Basel, Switzerland)·2026
Same author

An Innovative 3D Slicer Plugin for Brain Images Annotation and Lesions Study.

Studies in health technology and informatics·2026
Same author

On the Ethical Aspect of Artificial Intelligence-Based Decision Process for Transplantation.

Studies in health technology and informatics·2026
Same author

An Automatic Data Extracting Method for REDCap Folder Mapping: An Example for Cardiological Clinical Case.

Studies in health technology and informatics·2026
Same author

Bidirectional Mamba-2 boosts EEG super-resolution via regression and diffusion.

Bioinformatics (Oxford, England)·2026
Same author

From data to decisions: a modular platform for modelling and simulation of infectious disease diffusion in networks.

BMC medical informatics and decision making·2026
Same journal

An examination of the availability and characteristics of social needs data in the electronic health records: a path to social data harmonization and standardization at Johns Hopkins medicine.

JAMIA open·2026
Same journal

Generative artificial intelligence implementation in REDCap.

JAMIA open·2026
Same journal

Improving readability of layperson abstracts and summaries in oncology using task-specific large language model powered tool: results from the BRIDGE-AI 7 study.

JAMIA open·2026
Same journal

Accuracy of administrative data in ascertaining health conditions: a systematic review.

JAMIA open·2026
Same journal

Building a consumer health informatics introductory course consensus curriculum: an eDelphi study.

JAMIA open·2026
Same journal

A methotrexate dashboard: integrating MTXPK.org into the electronic health record to facilitate model-informed care for pediatric patients receiving high-dose methotrexate.

JAMIA open·2026
See all related articles
This summary is machine-generated.

DataAtlas automates data dictionary creation for tabular datasets, improving data interpretation and reuse. This system enhances data accessibility and analytical performance for clinical data.

Area of Science:

  • Data Science
  • Bioinformatics
  • Clinical Informatics

Background:

  • Dataset reuse is hampered by poor documentation, limiting secondary analysis.
  • Automated data dictionaries are needed to improve data interpretability and accessibility.

Purpose of the Study:

  • Develop DataAtlas, an open-source system for automated data dictionary generation from tabular datasets.
  • Enhance the accessibility, reproducibility, and reuse of clinical data through improved documentation.

Main Methods:

  • DataAtlas integrates structural profiling and large language model (LLM)-based semantic inference.
  • It generates descriptions using column metadata, statistical summaries, and sample values.
  • The system was evaluated on clinical datasets using validation, expert review, and task performance.
Keywords:
clinical datadata dictionarygenerative AIlarge language models

Related Experiment Videos

Last Updated: Jun 30, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Main Results:

  • Generated descriptions were often preferred over official documentation, especially when existing ones were incomplete.
  • Human expert review confirmed high accuracy and low hallucination rates for LLM-generated descriptions.
  • Augmenting database schemas with generated data dictionaries significantly improved text-to-SQL execution accuracy (0.52 to 0.88).

Conclusions:

  • Automated data dictionary generation enhances dataset interpretability and downstream analytical performance.
  • Column-level metadata, particularly sample values, is crucial for grounding LLM descriptions.
  • DataAtlas offers a practical solution for generating structured data dictionaries, boosting clinical data reuse.