Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors

García-Vicente, Clara; Chushig-Muzo, David; Mora-Jiménez, Inmaculada; Fabelo, Himar; Gram, Inger Torhild; Løchen, Maja-Lisa; Granja, Conceição; Soguero-Ruiz, Cristina

dc.contributor.author	García-Vicente, Clara
dc.contributor.author	Chushig-Muzo, David
dc.contributor.author	Mora-Jiménez, Inmaculada
dc.contributor.author	Fabelo, Himar
dc.contributor.author	Gram, Inger Torhild
dc.contributor.author	Løchen, Maja-Lisa
dc.contributor.author	Granja, Conceição
dc.contributor.author	Soguero-Ruiz, Cristina
dc.date.accessioned	2023-09-01T12:15:05Z
dc.date.available	2023-09-01T12:15:05Z
dc.date.issued	2023-03-23
dc.description.abstract	Machine Learning (ML) methods have become important for enhancing the performance of decision-support predictive models. However, class imbalance is one of the main challenges for developing ML models, because it may bias the learning process and the model generalization ability. In this paper, we consider oversampling methods for generating synthetic categorical clinical data aiming to improve the predictive performance in ML models, and the identification of risk factors for cardiovascular diseases (CVDs). We performed a comparative study of several categorical synthetic data generation methods, including Synthetic Minority Oversampling Technique Nominal (SMOTEN), Tabular Variational Autoencoder (TVAE) and Conditional Tabular Generative Adversarial Networks (CTGANs). Then, we assessed the impact of combining oversampling strategies and linear and nonlinear supervised ML methods. Lastly, we conducted a post-hoc model interpretability based on the importance of the risk factors. Experimental results show the potential of GAN-based models for generating high-quality categorical synthetic data, yielding probability mass functions that are very close to those provided by real data, maintaining relevant insights, and contributing to increasing the predictive performance. The GAN-based model and a linear classifier outperform other oversampling techniques, improving the area under the curve by 2%. These results demonstrate the capability of synthetic data to help with both determining risk factors and building models for CVD prediction.	en_US
dc.identifier.citation	García-Vicente, Chushig-Muzo, Mora-Jiménez, Fabelo, Gram, Løchen, Granja, Soguero-Ruiz. Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors. Applied Sciences. 2023;13(7)	en_US
dc.identifier.cristinID	FRIDAID 2145735
dc.identifier.doi	10.3390/app13074119
dc.identifier.issn	2076-3417
dc.identifier.uri	https://hdl.handle.net/10037/30624
dc.language.iso	eng	en_US
dc.publisher	MDPI	en_US
dc.relation.journal	Applied Sciences
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2023 The Author(s)	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0	en_US
dc.rights	Attribution 4.0 International (CC BY 4.0)	en_US
dc.title	Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors	en_US
dc.type.version	publishedVersion	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

Tilhørende fil(er)

Navn:: article.pdf
Størrelse:: 1.677Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Artikler, rapporter og annet (samfunnsmedisin) [1522]

Vis enkel innførsel

Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution 4.0 International (CC BY 4.0)