Instruction-guided deidentification with synthetic test cases for Norwegian clinical text

Lund, Jørgen Aarmo; Burman, Per Joel Burman; Woldaregay, Ashenafi Zebene; Jenssen, Robert; Mikalsen, Karl Øyvind

dc.contributor.author	Lund, Jørgen Aarmo
dc.contributor.author	Burman, Per Joel Burman
dc.contributor.author	Woldaregay, Ashenafi Zebene
dc.contributor.author	Jenssen, Robert
dc.contributor.author	Mikalsen, Karl Øyvind
dc.date.accessioned	2024-11-13T13:12:39Z
dc.date.available	2024-11-13T13:12:39Z
dc.date.issued	2024
dc.description.abstract	Deidentification methods, which remove directly identifying information, can be useful tools to mitigate the privacy risks associated with sharing healthcare data. However, benchmarks to evaluate deidentification methods are themselves often derived from real clinical data, making them sensitive themselves and therefore harder to share and apply. Given the rapid advances in generative language modelling, we would like to leverage large language models to construct freely available deidentification benchmarks, and to assist in the deidentification process. We apply the GPT-4 language model to, for the first time, construct a synthetic and publicly available dataset of synthetic Norwegian discharge summaries with annotated identifying details, consisting of 1200 summaries averaging 100 words each. In our sample of documents, we find that the generated annotations highly agree with human annotations, with an F1 score of 0.983. We then examine whether large language models can be applied directly to perform deidentification themselves, proposing methods where an instruction-tuned language model is prompted to either annotate or redact identifying details. Comparing the methods on our synthetic dataset and the NorSynthClinical-PHI dataset, we f ind that GPT-4 underperforms the baseline method proposed by Br˚athen et al. [1], suggesting that named entity recognition problems are still challenging for instruction-tuned language models.	en_US
dc.description	Source at <a href=https://proceedings.mlr.press/v233/>https://proceedings.mlr.press/v233/>https://proceedings.mlr.press/v233/</a>	en_US
dc.identifier.citation	Lund, Burman, Woldaregay, Jenssen, Mikalsen. Instruction-guided deidentification with synthetic test cases for Norwegian clinical text. Proceedings of Machine Learning Research (PMLR). 2024;233:145-152	en_US
dc.identifier.cristinID	FRIDAID 2300235
dc.identifier.issn	2640-3498
dc.identifier.uri	https://hdl.handle.net/10037/35695
dc.language.iso	eng	en_US
dc.publisher	PMLR	en_US
dc.relation.journal	Proceedings of Machine Learning Research (PMLR)
dc.relation.projectID	UiT Norges arktiske universitet: 303514	en_US
dc.relation.projectID	Norges forskningsråd: 327520	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2024 The Author(s)	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0	en_US
dc.rights	Attribution 4.0 International (CC BY 4.0)	en_US
dc.title	Instruction-guided deidentification with synthetic test cases for Norwegian clinical text	en_US
dc.type.version	publishedVersion	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

File(s) in this item

Name:: article.pdf
Size:: 319.8Kb
Format:: PDF

View/Open

This item appears in the following collection(s)

Artikler, rapporter og annet (UB) [3275]

Show simple item record

Except where otherwise noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)