Show simple item record

dc.contributor.authorLund, Jørgen Aarmo
dc.contributor.authorBurman, Per Joel Burman
dc.contributor.authorWoldaregay, Ashenafi Zebene
dc.contributor.authorJenssen, Robert
dc.contributor.authorMikalsen, Karl Øyvind
dc.date.accessioned2024-11-13T13:12:39Z
dc.date.available2024-11-13T13:12:39Z
dc.date.issued2024
dc.description.abstractDeidentification methods, which remove directly identifying information, can be useful tools to mitigate the privacy risks associated with sharing healthcare data. However, benchmarks to evaluate deidentification methods are themselves often derived from real clinical data, making them sensitive themselves and therefore harder to share and apply. Given the rapid advances in generative language modelling, we would like to leverage large language models to construct freely available deidentification benchmarks, and to assist in the deidentification process. We apply the GPT-4 language model to, for the first time, construct a synthetic and publicly available dataset of synthetic Norwegian discharge summaries with annotated identifying details, consisting of 1200 summaries averaging 100 words each. In our sample of documents, we find that the generated annotations highly agree with human annotations, with an F1 score of 0.983. We then examine whether large language models can be applied directly to perform deidentification themselves, proposing methods where an instruction-tuned language model is prompted to either annotate or redact identifying details. Comparing the methods on our synthetic dataset and the NorSynthClinical-PHI dataset, we f ind that GPT-4 underperforms the baseline method proposed by Br˚athen et al. [1], suggesting that named entity recognition problems are still challenging for instruction-tuned language models.en_US
dc.descriptionSource at <a href=https://proceedings.mlr.press/v233/>https://proceedings.mlr.press/v233/>https://proceedings.mlr.press/v233/</a>en_US
dc.identifier.citationLund, Burman, Woldaregay, Jenssen, Mikalsen. Instruction-guided deidentification with synthetic test cases for Norwegian clinical text. Proceedings of Machine Learning Research (PMLR). 2024;233:145-152en_US
dc.identifier.cristinIDFRIDAID 2300235
dc.identifier.issn2640-3498
dc.identifier.urihttps://hdl.handle.net/10037/35695
dc.language.isoengen_US
dc.publisherPMLRen_US
dc.relation.journalProceedings of Machine Learning Research (PMLR)
dc.relation.projectIDUiT Norges arktiske universitet: 303514en_US
dc.relation.projectIDNorges forskningsråd: 327520en_US
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2024 The Author(s)en_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0en_US
dc.rightsAttribution 4.0 International (CC BY 4.0)en_US
dc.titleInstruction-guided deidentification with synthetic test cases for Norwegian clinical texten_US
dc.type.versionpublishedVersionen_US
dc.typeJournal articleen_US
dc.typeTidsskriftartikkelen_US
dc.typePeer revieweden_US


File(s) in this item

Thumbnail

This item appears in the following collection(s)

Show simple item record

Attribution 4.0 International (CC BY 4.0)
Except where otherwise noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)