Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions

Date
2025-11-06
Language
American English
Embargo Lift Date
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
MDPI
Can't use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Abstract

Background: Artificial intelligence (AI) has shown significant promise in augmenting diagnostic capabilities across medical specialties. Recent advancements in generative AI allow for synthesis and interpretation of complex clinical data including imaging and patient history to assess disease risk. Objective: To evaluate the diagnostic performance of a dermatology-trained multimodal large language model (DermFlow, Delaware, USA) in assessing malignancy risk of pigmented skin lesions. Methods: This retrospective study utilized data from 59 patients with 68 biopsy-proven pigmented skin lesions seen at Indiana University clinics from February 2023 to May 2025. De-identified patient histories and clinical images were input into DermFlow, and clinical images only were input into Claude Sonnet 4 (Claude) to generate differential diagnoses. Clinician pre-operative diagnoses were extracted from the clinical note. Assessments were compared to histopathologic diagnoses (gold standard). Results: Among 68 clinically concerning pigmented lesions, DermFlow achieved 47.1% top diagnosis accuracy and 92.6% any-diagnosis accuracy, with F1 = 0.948, sensitivity 93.9%, and specificity 89.5% (balanced accuracy 91.7%). Claude had 8.8% top diagnosis and 73.5% any-diagnosis accuracy, F1 = 0.816, sensitivity 81.6%, specificity 52.6% (balanced accuracy 67.1%). Clinicians achieved 38.2% top diagnosis and 72.1% any-diagnosis accuracy, F1 = 0.776, sensitivity 67.3%, specificity 84.2% (balanced accuracy 75.8%). DermFlow recommended biopsy in 95.6% of cases vs. 82.4% for Claude, with multiple pairwise differences favoring DermFlow (p < 0.05). Conclusions: DermFlow demonstrated comparable or superior diagnostic performance to clinicians and superior performance to Claude in evaluating pigmented skin lesions. Although additional data must be gathered to further validate the model in real clinical settings, these initial findings suggest potential utility for dermatology-trained AI models in clinical practice, particularly in settings with limited dermatologist availability.

Description
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
Mijares J, Jairath N, Zhang A, Que SKT. Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions. Diagnostics (Basel). 2025;15(21):2808. Published 2025 Nov 6. doi:10.3390/diagnostics15212808
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Diagnostics
Source
PMC
Alternative Title
Type
Article
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Final published version
Full Text Available at
This item is under embargo {{howLong}}