Sammendrag
In healthcare, vast amounts of data are stored digitally in the electronic health records (EHRs). EHRs represent a largely untapped source of clinically relevant information, which combined with advances in machine learning, have the potential to transform healthcare into a more data-driven direction. However, due to the complexity and poor quality of the EHRs, data-driven healthcare is facing many challenges. In this thesis, we address the challenge posed by lack of ground-truth labels and provide methodological solutions to challenges related with missing data, temporality, and high dimensionality. Towards that end, we present four lines of work where we develop novel unsupervised and weakly supervised learning methodology. The first work presents a kernel for multivariate time series with missing values, which frequently occur in the EHRs. Key components in the method are clustering and ensemble learning. Experiments on benchmark datasets demonstrate that the proposed kernel is robust to hyper-parameter choices and performs well in presence of missing data. Next, we present a dimensionality reduction method, which is designed to account for many of the challenges data-driven healthcare is facing. One of them is high dimensionality, but in addition, the method is capable of exploiting noisy and partially labeled multi-label data. We provide a case study of patients suffering from chronic diseases. In the third work, we present a kernel capable of exploiting informative missingness in multivariate time series, as well as a novel semi-supervised kernel. The effectiveness of the proposed methods is demonstrated via experiments on benchmark data and a case study of patients suffering from infectious postoperative complications. In the last work, we perform phenotyping of patients with postoperative delirium using a weakly supervised learning framework, wherein clinical knowledge is used to generate a noisy labeled training set, which in turn is used to train classifiers. Experiments on a dataset collected from a Norwegian university hospital demonstrate the efficiency of the framework.
Har del(er)
Paper I: Mikalsen, K.Ø., Bianchi, F.M., Soguero-Ruiz, C. & Jenssen, R. (2018). Time series cluster kernel for learning similarities between multivariate time series with missing data. Pattern Recognition, 76, 569-581. The article is available in the thesis introduction. Also available at https://doi.org/10.1016/j.patcog.2017.11.030. Accepted manuscript available at http://hdl.handle.net/10037/13578.
Paper II: Mikalsen, K.Ø., Soguero-Ruiz, C., Bianchi, F.M. & Jenssen, R. Noisy multi-label semi-supervised dimensionality reduction (Submitted manuscript). Published version in Pattern Recognition, 90, 257-270 available at https://doi.org/10.1016/j.patcog.2019.01.033.
Paper III: Mikalsen, K.Ø., Soguero-Ruiz, C., Bianchi, F.M., Revhaug, A. & Jenssen, R. Time series cluster kernels to exploit informative missingness and incomplete label information. (Submitted manuscript).
Paper IV: Mikalsen, K.Ø., Soguero-Ruiz, C., Jensen, K., Hindberg, K., Gran, M., Revhaug, A. … Jenssen, R. (2017). Using anchors from free text in electronic health records to diagnose postoperative delirium. Computer Methods and Programs in Biomedicine, 152, 105–114. The article is available in the thesis introduction. Also available at https://doi.org/10.1016/j.cmpb.2017.09.014.