Clustering genomic signatures

J. Gustafsson and E. Norlander

Master's Thesis, Chalmers University of Technology, Jun 2018.

Pathogens such as bacteria and viruses are leading causes of disease worldwide, which makes it essential to identify them in DNA samples. Instead of analysing raw DNA sequences, mathematical models based on Variable Length Markov Chains (VLMCs), known as Genomic signatures, make it possible to classify DNA samples faster than with traditional alignment-based methods. To analyse a set of genomic signatures, we use clustering, which is an unsupervised machine-learning method. For the clustering of VLMCs, an accurate and fast similarity measure (distance function) is needed. To analyse distance functions and clusters, we define metrics based primarily on the taxonomic ranks of the underlying organisms. For the distance functions, we primarily analysed whether the VLMCs within the same taxonomic rank were closest to each other. For the cluster analysis, we use the silhouette metric to determine how well separated the clusters are and define the average percentages, sensitivity, and specificity of the captured taxonomic ranks. We present a new distance function for VLMCs, called Frobenius-intersection, which correlates accurately with the well-known Kullback-Liebler distance function, while also being several orders of magnitude faster. We use average-link clustering together with the Frobenius-intersection distance to cluster data sets of known viruses and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond accurately to the Baltimore types of the viruses as well as the viruses’ and bacteria’s taxonomic families. However, most of the classifications of viruses are also subdivided into multiple clusters. Moreover, when combining the set of bacteria and viruses, the clusters start to mix the viruses and bacteria before finding all of the taxonomic families. The clustering of the genomic signatures is accurate with respect to, for instance, taxonomic ordering. Therefore, it can help in identifying unclassified pathogens. Future research may reveal other causes of similarity between the genomic signatures.

A reprint is available as PDF.

Further publications by Joel Gustafsson, Erik Norlander.