Show simple item record

dc.contributor.authorThorvaldsen, Steinar
dc.contributor.authorHössjer, Ola
dc.date.accessioned2024-09-24T07:23:59Z
dc.date.available2024-09-24T07:23:59Z
dc.date.issued2024-06-12
dc.description.abstractA large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space X as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on X between the prior distribution of data and the empirical distribution of the sample. A number of distances on X are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on X based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on X , with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.en_US
dc.identifier.citationThorvaldsen, Hössjer. Use of directed quasi-metric distances for quantifying the information of gene families. Biosystems (Amsterdam. Print). 2024en_US
dc.identifier.cristinIDFRIDAID 2279886
dc.identifier.doi10.1016/j.biosystems.2024.105256
dc.identifier.issn0303-2647
dc.identifier.issn1872-8324
dc.identifier.urihttps://hdl.handle.net/10037/34834
dc.language.isoengen_US
dc.publisherElsevieren_US
dc.relation.journalBiosystems (Amsterdam. Print)
dc.rights.accessRightsopenAccessen_US
dc.rights.holderCopyright 2024 The Author(s)en_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0en_US
dc.rightsAttribution 4.0 International (CC BY 4.0)en_US
dc.titleUse of directed quasi-metric distances for quantifying the information of gene familiesen_US
dc.type.versionpublishedVersionen_US
dc.typeJournal articleen_US
dc.typeTidsskriftartikkelen_US
dc.typePeer revieweden_US


File(s) in this item

Thumbnail

This item appears in the following collection(s)

Show simple item record

Attribution 4.0 International (CC BY 4.0)
Except where otherwise noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)