Small data: practical modeling issues in human-model -omic data

Holsbø, Einar Jakobsen

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.author	Holsbø, Einar Jakobsen
dc.date.accessioned	2019-02-08T14:28:48Z
dc.date.available	2019-02-08T14:28:48Z
dc.date.issued	2019-02-08
dc.description.abstract	Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines.	en_US
dc.description.doctoraltype	ph.d.	en_US
dc.description.popularabstract	Data derived from humans are very valuable in biomedical research. Access to human participants in research projects is limited by costs and ethical considerations. It is comparatively cheap to make many measurements of the participants we have recruited. Modern gene sequencing technologies enable us to take thousands to hundreds-of-thousands of measurements for the tens to hundreds of participants of a recearch project. This presents unique data analysis challenges. I explore these challenges and make guidelines for for what I call the "small data" regime.	en_US
dc.description	<p>This thesis is based on the following articles: <p>Chapter 2: Holsbø, E., Perduca, V., Bongo, L.A., Lund, E. & Birmelé, E. (Manuscript). Stratified time-course gene preselection shows a pre-diagnostic transcriptomic signal for metastasis in blood cells: a proof of concept from the NOWAC study. Available at <a href= https://doi.org/10.1101/141325>https://doi.org/10.1101/141325</a>. <p>Chapter 3: Bøvelstad, H.M., Holsbø, E., Bongo, L.A. & Lund, E. (Manuscript). A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets. Available at <a href= https://doi.org/10.1101/144519>https://doi.org/10.1101/144519</a>. <p>Chapter 4: Holsbø, E. & Perduca, V. (2018). Shrinkage estimation of rate statistics. <i>Case Studies in Business, Industry and Government Statistics 7</i>(1), 14-25. Also available at <a href=http://hdl.handle.net/10037/14678>http://hdl.handle.net/10037/14678</a>.	en_US
dc.identifier.isbn	978-82-8236-330-3 (trykt) og 978-82-8236-331-0 (pdf)
dc.identifier.uri	https://hdl.handle.net/10037/14660
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	en_US
dc.publisher	UiT The Arctic University of Norway	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2019 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/3.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)	en_US
dc.subject	VDP::Mathematics and natural science: 400::Mathematics: 410::Statistics: 412	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Matematikk: 410::Statistikk: 412	en_US
dc.subject	VDP::Mathematics and natural science: 400::Information and communication science: 420	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420	en_US
dc.subject	VDP::Mathematics and natural science: 400::Basic biosciences: 470::Bioinformatics: 475	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Basale biofag: 470::Bioinformatikk: 475	en_US
dc.subject	VDP::Mathematics and natural science: 400::Basic biosciences: 470::Genetics and genomics: 474	en_US
dc.subject	VDP::Matematikk og Naturvitenskap: 400::Basale biofag: 470::Genetikk og genomikk: 474	en_US
dc.title	Small data: practical modeling issues in human-model -omic data	en_US
dc.type	Doctoral thesis	en_US
dc.type	Doktorgradsavhandling	en_US

File(s) in this item

Name:: thesis.pdf
Size:: 7.460Mb
Format:: PDF

View/Open

Name:: license.txt
Size:: 1.402Kb
Format:: Text file

View/Open

This item appears in the following collection(s)

Doktorgradsavhandlinger (NT-fak) [322]

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Small data: practical modeling issues in human-model -omic data

File(s) in this item

This item appears in the following collection(s)

Related items

Geometric Modeling- and Sensor Technology Applications for Engineering Problems ﻿

Engineering methods for enhancing railway geometry and winter road assessment: A safety and maintenance perspective ﻿

Iceberg Drift-Trajectory Modelling and Probability Distributions of the Predictions ﻿

Geometric Modeling- and Sensor Technology Applications for Engineering Problems

Engineering methods for enhancing railway geometry and winter road assessment: A safety and maintenance perspective

Iceberg Drift-Trajectory Modelling and Probability Distributions of the Predictions