A novel fine-tuning and evaluation methodology for large language models on IoT raw data summaries (LLM-RawDMeth): A joint perspective in diabetes care

  • 1Department of Computer Science, University of Jaén, Jaén, 23071, Spain. Electronic address: jgaitan@ujaen.es.
  • 2Department of Languages and Computer Systems, University of Granada, E.T.S. de Ingenierías Informática y de Telecomunicación, Granada, 18071, Spain. Electronic address: cmcruz@ugr.es.
  • 3Department of Computer Science, University of Jaén, Jaén, 23071, Spain. Electronic address: mestevez@ujaen.es.
  • 4Department of Computer Science, University of Jaén, Jaén, 23071, Spain. Electronic address: ddjimene@ujaen.es.
  • 5Department of Computer Science, University of Jaén, Jaén, 23071, Spain. Electronic address: llopez@ujaen.es.

Abstract

BACKGROUND AND OBJECTIVE

Diabetes is a global health concern, affecting millions of adults worldwide and exhibiting a growing prevalence. Managing the disease highly relies on continuous glucose monitoring, yet the dense and complex nature of electronic devices data streams poses significant challenges for efficient interpretation. Large Language Models are being widely applied across different domains for their ability to generate human-like text, but still fall short in producing accurate and meaningful text from raw data. To address this limitation, this study proposes a fine-tuning methodology tailored specifically to glucose data, but scalable to other expert-guided domains, enabling the models to generate concise, relevant and safe summaries, bridging the gap between raw data and efficient medical attention.

METHODS

This study introduces a novel continuous glucose monitoring framework that involves fine-tuned GPT models using structured datasets generated through an expert-guided data modeling based on Fuzzy Logic and prompt engineering for task contextualization. A new evaluation methodology is defined to assess the performance of the Large Language Models across different critical domains where expert knowledge is fundamental to characterize temporally dependent data and ensure valuable insights.

RESULTS

Fine-tuned GPT-4o achieved the highest performance, with an average score of 96% across all metrics. GPT-4o-mini followed with 76% score, while GPT-3.5 scored 72%. The use of fuzzy knowledge-based prompts proved more effective in scenarios with full data availability, or in scenarios with a simplified data availability when the models are not fine-tuned; domain-guided prompts improved output relevance and stability in fine-tuned models with less data availability.

CONCLUSIONS

These results indicate the capability of our methods to align Large Language Models with the task of generating human-like text from raw data, highlighting their potential to manage diabetes by complex glucose patterns interpretation, alleviating the burden on healthcare systems.