Scalable exploration of population-scale drug consumption data

Skar, Tengel Ekrem

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.advisor	Svendsen, Kristian
dc.contributor.advisor	Holsbø, Einar
dc.contributor.author	Skar, Tengel Ekrem
dc.date.accessioned	2019-07-17T10:36:15Z
dc.date.available	2019-07-17T10:36:15Z
dc.date.issued	2019-06-01
dc.description.abstract	The potential for knowledge discovery is currently underutilized on pharmacoepidemiologic data sets. A big dataset enables finding and assessing rare drug consumption patterns that are associated with adverse drug reactions causing hospitalization, or death. To enable such exploration of big pharmacoepidemiology data, four key issues need so be addressed. First, to ingest, transform, preprocess and analyze population scale data, we require large computation power and storage capabilities, and therefore a distributed computing framework. Second, to expose patterns between drug consumption and end-points such as hospitalization, we need to develop feature extraction and preprocessing algorithms which represents the drug consumption and hospitalization in a numerical format. Third, to detect these patterns, we require models from libraries for statistics and machine learning. To interpret performance metrics, we also require visualization libraries. Fourth, to enable rapid development of data exploration methods, we require an interactive system that makes the frameworks, libraries and methods for explorative analyses available in a single, cohesive environment. We make three contributions. First, we present the design and implementation of a system with a live coding environment, which enables use of Apache Spark, our choice of big data framework. It provides Scikit-learn and Tensorflow with Keras for machine learning, and matplotlib and Plotly for visualization. All libraries and frameworks are made available by the interactive environment, which enables rapid development, and Spark enables workloads to scale. Second, to enable machine learning methods, we provide algorithms for feature extraction of drug consumption. We observe drug consumption in hospitalized and unhospitalized patient groups, and label them according to their group. This results in a data set that we use in supervised learning. Third, we assess the performance in prediction of hospitalization on the data set. We also estimate over-represented drugs in hospitalized patients. The results are available in an executable notebook format, and the implementations are modifiable so that researchers can re-purpose the preprocessing algorithms and analyses for their needs. To predict hospitalization, a logistic regression achieved an Area Under the receiver operating characteristic Curve (AUC) of 0.758, and a neural network achieved an AUC of 0.771. We bootstrapped logistic regressions to obtain a list of 200 (of 900) drugs that the regression obtains stable estimates for. The omitted 700 drugs had high variance, which indicates that they are under-represented in our data altogether. The predictive performances were not very good. From the bootstrap analysis we identified which drugs occur frequently enough in our data, and which don't. We believe that improved data cleaning can improve both models prediction performance. We believe more data will enable more accurate log-odds estimates for the remaining 700 drugs. We learned that good prediction of hospitalization from drug consumption isn't possible with our current preprocessing, but we also learned which drugs that are most and least likely usable for prediction.	en_US
dc.identifier.uri	https://hdl.handle.net/10037/15776
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2019 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)	en_US
dc.subject.courseID	INF-3981
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Kommunikasjon og distribuerte systemer: 423	en_US
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420::Communication and distributed systems: 423	en_US
dc.title	Scalable exploration of population-scale drug consumption data	en_US
dc.type	Master thesis	en_US
dc.type	Mastergradsoppgave	en_US