Clustering genomic words in human DNA using peaks and trends of distributions
In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust...
Autor principal: | |
---|---|
Outros Autores: | , , , |
Formato: | article |
Idioma: | eng |
Publicado em: |
2021
|
Assuntos: | |
Texto completo: | http://hdl.handle.net/10773/30267 |
País: | Portugal |
Oai: | oai:ria.ua.pt:10773/30267 |
Resumo: | In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns. |
---|