Clustering genomic words in human DNA using peaks and trends of distributions

In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust...

Full description

Bibliographic Details
Main Author:	Tavares, Ana Helena (author)
Other Authors:	Raymaekers, Jakob (author), Rousseeuw, Peter J. (author), Brito, Paula (author), Afreixo, Vera (author)
Format:	article
Language:	eng
Published:	2021
Subjects:	Classification Pattern recognition Robustness Word distances
Online Access:	http://hdl.handle.net/10773/30267
Country:	Portugal
Oai:	oai:ria.ua.pt:10773/30267

Description
Summary:	In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

Clustering genomic words in human DNA using peaks and trends of distributions

Similar Items

Need Help?