The Context
Evolution operates molecular alterations of two types: point
mutations, namely insertion, deletion or substitution of one
residue, and segment-based modifications: duplication,
inversion, insertion, etc of a segment of the sequence. Genome
level mutations operate also on large pieces of DNA and can thus
be included in segment-based modifications. To our knowledge,
no measure attempts to quantify dissimilarity by assessing
segment-based differences, and by describing the differences
between two sequences with an edit-script containing such
segment-based operations. Consequently, sequence comparison is
usually performed on similar parts of the sequences, like
structurally or functionally related domains of proteins. Even
if they correspond to complete biological entities like whole
gene or protein, entire sequences are not compared, or only in
case of high similarity. With such restrictions, one misses some
evolutionary meaningful information written in the molecules.
The Method
We propose a new distance which evaluates segment-based
dissimilarity between two sequences. This measure relates to the
process of constructing a target sequence T using segments from
a source sequence S. We consider a set of two segment based
operations: the copy of a segment of S into T and the insertion
of a new segment. Constructing T is done by applying a list of
such operations, which we name a script.
A characteristic of our definition is the way operations are
weighted. As a target T can always be obtained from a single
insertion of the segment T itself, it is useless to weight a
script by its number of operations. From the biological point of
view, it exists no satisfactory probabilistic model which applies
to those operations on a sequence. A natural, theoretically
well-founded solution comes from the Algorithmic Information
Theory (AIT), which explains how fair a priori weights can be
assigned in the absence of specific knowledge. Following the
Minimum Description Length Principle (MDLP), we weight an
operation by its description length and a script by the sum of its
operations weights. The transformation distance is defined as the
minimal weight among all possible scripts.
In the AIT, the description length is a measure of the
information content of a sequence of characters. The set of
operations (which can be chosen different from the one we use
here), defines a model of sequence construction. In this model,
the transformation distance may be interpreted as the most
efficient way to design T using information from S. It
approximates the relative information content of T knowing
S. For a general model, the AIT defines similarly an
"information distance". The information content as a criterion
for sequence analysis has already been used in many different
contexts: formal machine learning, economy and complexity
theory. In biology, several researches focus on compression of
the genetic sequences in order to discover significant
structures like: direct repeats and tandem repeats. Although
suggested by Grumbach and Tahi, this criterion nor the
information distance has never been used to perform direct
sequence comparisons ; this is what we propose with the
transformation distance.
Publications
-
JS Varré, JP Delahaye, E Rivals, M Dauchet, Distance par copie de
facteurs, Journées Informatique et Biologie, Ecole Polytechnique,
Palaiseau - France, 1996
-
JS Varré, Distances
basées sur des notions de compression de données et application
à la phylogénie, Mémoire de DEA, Laboratoire d'Informatique
Fondamentale de Lille, 1996
-
JS Varré, JP Delahaye, E Rivals, The Transformation Distance, Mathematical
Analysis of Biological Sequences, Rouen - France, 1997
-
JS Varré, JP Delahaye, E Rivals, The
Transformation Distance, Genome Informatics Worshop,
Tokyo - Japan, 1997
-
JS Varré, JP Delahaye, E Rivals, La distance des
transformations et applications à la phylogénie,
Séminaires Algorithmique et Biologie, Phylogénie, Institut
Pasteur, Paris - France, 1998
-
JS Varré, JP Delahaye, E Rivals, The Transformation Distance
: A Dissimilarity Measure Based On Movements Of Segments, German
Conference on Bioinformatics, Koel - Germany, 1998
-
JS Varré, JP Delahaye, E Rivals, Transformation
Distances: a family of dissimilarity measures based on
movements of segments, Bioinformatics, vol. 15, no. 3, pp
194-202, 1999.
Software
The TD program written in C is freely available and allows the
computation of the transformation distance for a set of DNA
sequences. Download and explanations about installation and
utilization can be found on
this page.