Fine-tuning Large Language Models on historical causes of death data
Permanent link
https://hdl.handle.net/10037/34160Date
2024-05-15Type
Master thesisMastergradsoppgave
Author
Wilhelmsen, Kristoffer BergAbstract
This thesis assesses the impact of fine-tuning and rag on llms in accurately assigning icd-10 codes to historical causes of death. Using funeral records from Trondheim, Norway (1830-1920), we fine-tuned Llama 3 and Mistral on 2000 records. Twelve experiments were conducted on 2000 additional records to evaluate the accuracy of each knowledge-injection technique, as well as a combination of the two.
The results indicate that fine-tuning as a standalone knowledge-injection technique achieved the highest accuracy, generating 88% full matches and 2% partial matches for icd-10 codes, up from 58% full matches and 25% partial matches in previous research. However, concerns regarding memorization of training data due to the lack of diversity in the available dataset remain. Moreover, combining RAG with fine-tuning led to a decrease in accuracy, while a sole rag approach decreased the results even further. These findings serve as proof-of-concept for the automatic assignment of icd-10 codes to historical causes of death, paving the way for future research.
Publisher
UiT Norges arktiske universitetUiT The Arctic University of Norway
Metadata
Show full item record
Copyright 2024 The Author(s)
The following license file are associated with this item: