The Transformation Distance
The PhD thesis : "Concepts and algorithms for the comparison of genetic sequences: an information theory approach" (in french)
The Context
Evolution operates molecular alterations of two types: point mutations, namely insertion, deletion or substitution of one residue, and segment-based modifications: duplication, inversion, insertion, etc of a segment of the sequence. Genome level mutations operate also on large pieces of DNA and can thus be included in segment-based modifications. To our knowledge, no measure attempts to quantify dissimilarity by assessing segment-based differences, and by describing the differences between two sequences with an edit-script containing such segment-based operations. Consequently, sequence comparison is usually performed on similar parts of the sequences, like structurally or functionally related domains of proteins. Even if they correspond to complete biological entities like whole gene or protein, entire sequences are not compared, or only in case of high similarity. With such restrictions, one misses some evolutionary meaningful information written in the molecules.
The Method
We propose a new distance which evaluates segment-based dissimilarity between two sequences. This measure relates to the process of constructing a target sequence T using segments from a source sequence S. We consider a set of two segment based operations: the copy of a segment of S into T and the insertion of a new segment. Constructing T is done by applying a list of such operations, which we name a script. A characteristic of our definition is the way operations are weighted. As a target T can always be obtained from a single insertion of the segment T itself, it is useless to weight a script by its number of operations. From the biological point of view, it exists no satisfactory probabilistic model which applies to those operations on a sequence. A natural, theoretically well-founded solution comes from the Algorithmic Information Theory (AIT), which explains how fair a priori weights can be assigned in the absence of specific knowledge. Following the Minimum Description Length Principle (MDLP), we weight an operation by its description length and a script by the sum of its operations weights. The transformation distance is defined as the minimal weight among all possible scripts. In the AIT, the description length is a measure of the information content of a sequence of characters. The set of operations (which can be chosen different from the one we use here), defines a model of sequence construction. In this model, the transformation distance may be interpreted as the most efficient way to design T using information from S. It approximates the relative information content of T knowing S. For a general model, the AIT defines similarly an "information distance". The information content as a criterion for sequence analysis has already been used in many different contexts: formal machine learning, economy and complexity theory. In biology, several researches focus on compression of the genetic sequences in order to discover significant structures like: direct repeats and tandem repeats. Although suggested by Grumbach and Tahi, this criterion nor the information distance has never been used to perform direct sequence comparisons ; this is what we propose with the transformation distance.
Publications
Software
The TD program written in C is freely available and allows the computation of the transformation distance for a set of DNA sequences. Download and explanations about installation and utilization can be found on this page.
2/12/2006 Jean-Stéphane Varré