Vis enkel innførsel

dc.contributor.advisorBongo, Lars Ailo
dc.contributor.advisorSvendsen, Kristian
dc.contributor.advisorHolsbø, Einar
dc.contributor.authorSkar, Tengel Ekrem
dc.date.accessioned2019-07-17T10:36:15Z
dc.date.available2019-07-17T10:36:15Z
dc.date.issued2019-06-01
dc.description.abstractThe potential for knowledge discovery is currently underutilized on pharmacoepidemiologic data sets. A big dataset enables finding and assessing rare drug consumption patterns that are associated with adverse drug reactions causing hospitalization, or death. To enable such exploration of big pharmacoepidemiology data, four key issues need so be addressed. First, to ingest, transform, preprocess and analyze population scale data, we require large computation power and storage capabilities, and therefore a distributed computing framework. Second, to expose patterns between drug consumption and end-points such as hospitalization, we need to develop feature extraction and preprocessing algorithms which represents the drug consumption and hospitalization in a numerical format. Third, to detect these patterns, we require models from libraries for statistics and machine learning. To interpret performance metrics, we also require visualization libraries. Fourth, to enable rapid development of data exploration methods, we require an interactive system that makes the frameworks, libraries and methods for explorative analyses available in a single, cohesive environment. We make three contributions. First, we present the design and implementation of a system with a live coding environment, which enables use of Apache Spark, our choice of big data framework. It provides Scikit-learn and Tensorflow with Keras for machine learning, and matplotlib and Plotly for visualization. All libraries and frameworks are made available by the interactive environment, which enables rapid development, and Spark enables workloads to scale. Second, to enable machine learning methods, we provide algorithms for feature extraction of drug consumption. We observe drug consumption in hospitalized and unhospitalized patient groups, and label them according to their group. This results in a data set that we use in supervised learning. Third, we assess the performance in prediction of hospitalization on the data set. We also estimate over-represented drugs in hospitalized patients. The results are available in an executable notebook format, and the implementations are modifiable so that researchers can re-purpose the preprocessing algorithms and analyses for their needs. To predict hospitalization, a logistic regression achieved an Area Under the receiver operating characteristic Curve (AUC) of 0.758, and a neural network achieved an AUC of 0.771. We bootstrapped logistic regressions to obtain a list of 200 (of 900) drugs that the regression obtains stable estimates for. The omitted 700 drugs had high variance, which indicates that they are under-represented in our data altogether. The predictive performances were not very good. From the bootstrap analysis we identified which drugs occur frequently enough in our data, and which don't. We believe that improved data cleaning can improve both models prediction performance. We believe more data will enable more accurate log-odds estimates for the remaining 700 drugs. We learned that good prediction of hospitalization from drug consumption isn't possible with our current preprocessing, but we also learned which drugs that are most and least likely usable for prediction.en_US
dc.identifier.urihttps://hdl.handle.net/10037/15776
dc.language.isoengen_US
dc.publisherUiT Norges arktiske universiteten_US
dc.publisherUiT The Arctic University of Norwayen_US
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2019 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/4.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)en_US
dc.subject.courseIDINF-3981
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Kommunikasjon og distribuerte systemer: 423en_US
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420::Communication and distributed systems: 423en_US
dc.titleScalable exploration of population-scale drug consumption dataen_US
dc.typeMaster thesisen_US
dc.typeMastergradsoppgaveen_US


Tilhørende fil(er)

Thumbnail
Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)