Show simple item record

dc.contributor.advisorBongo, Lars Ailo
dc.contributor.authorHolsbø, Einar Jakobsen
dc.date.accessioned2019-02-08T14:28:48Z
dc.date.available2019-02-08T14:28:48Z
dc.date.issued2019-02-08
dc.description.abstractHuman-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines.en_US
dc.description.doctoraltypeph.d.en_US
dc.description.popularabstractData derived from humans are very valuable in biomedical research. Access to human participants in research projects is limited by costs and ethical considerations. It is comparatively cheap to make many measurements of the participants we have recruited. Modern gene sequencing technologies enable us to take thousands to hundreds-of-thousands of measurements for the tens to hundreds of participants of a recearch project. This presents unique data analysis challenges. I explore these challenges and make guidelines for for what I call the "small data" regime.en_US
dc.description<p>This thesis is based on the following articles: <p>Chapter 2: Holsbø, E., Perduca, V., Bongo, L.A., Lund, E. & Birmelé, E. (Manuscript). Stratified time-course gene preselection shows a pre-diagnostic transcriptomic signal for metastasis in blood cells: a proof of concept from the NOWAC study. Available at <a href= https://doi.org/10.1101/141325>https://doi.org/10.1101/141325</a>. <p>Chapter 3: Bøvelstad, H.M., Holsbø, E., Bongo, L.A. & Lund, E. (Manuscript). A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets. Available at <a href= https://doi.org/10.1101/144519>https://doi.org/10.1101/144519</a>. <p>Chapter 4: Holsbø, E. & Perduca, V. (2018). Shrinkage estimation of rate statistics. <i>Case Studies in Business, Industry and Government Statistics 7</i>(1), 14-25. Also available at <a href=http://hdl.handle.net/10037/14678>http://hdl.handle.net/10037/14678</a>.en_US
dc.identifier.isbn978-82-8236-330-3 (trykt) og 978-82-8236-331-0 (pdf)
dc.identifier.urihttps://hdl.handle.net/10037/14660
dc.language.isoengen_US
dc.publisherUiT Norges arktiske universiteten_US
dc.publisherUiT The Arctic University of Norwayen_US
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2019 The Author(s)
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0en_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)en_US
dc.subjectVDP::Mathematics and natural science: 400::Mathematics: 410::Statistics: 412en_US
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Matematikk: 410::Statistikk: 412en_US
dc.subjectVDP::Mathematics and natural science: 400::Information and communication science: 420en_US
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420en_US
dc.subjectVDP::Mathematics and natural science: 400::Basic biosciences: 470::Bioinformatics: 475en_US
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Basale biofag: 470::Bioinformatikk: 475en_US
dc.subjectVDP::Mathematics and natural science: 400::Basic biosciences: 470::Genetics and genomics: 474en_US
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Basale biofag: 470::Genetikk og genomikk: 474en_US
dc.titleSmall data: practical modeling issues in human-model -omic dataen_US
dc.typeDoctoral thesisen_US
dc.typeDoktorgradsavhandlingen_US


File(s) in this item

Thumbnail
Thumbnail

This item appears in the following collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)