## Strings Tutorial 3: Motif FinderThis page describes an experimental feature that is likely to change in future releases
In this tutorial we'll implement a simple model for finding motifs in nucleotide sequences, which constitutes an important problem in bioinformatics. This sample code along with a Visual Studio project can be found in the Samples\C#\MotifFinder folder. ## What is a motif finderIn genetics, a sequence motif is a widespread pattern in a set of nucleotide or amino-acid sequences that is likely to have some biological significance. The problem of motif finding is to discover such a common pattern in a given set of sequences, that are usually known to share some common property. It can be then conjectured that the found shared pattern contributes to implementing that property. The pattern can slightly differ from sequence to sequence due to variance introduced by biological replication mechanisms, but nevertheless we expect to see some common structure. For instance, for the set of strings
one can argue that the common pattern is TAT*G. Note that the 4-th character of the pattern varies between strings, so it may seem at first that the length of the motif is 3. However, since the 5-th character is always the same, the length of the motif is likely to be 5, the 4-th character being a point of variability. In this tutorial, our job will be to define a generative model of nucleotide sequences (essentially, strings consisting of 'A', 'C', 'T' and 'G' characters) containing a motif, and then perform backward inference to determine the motif from the sequences. ## Basic modelThe generative model we're going to use in this tutorial is as follows. For each sequence we'll first determine the position of the motif in it, assuming it's uniformly distributed across possible positions. We will then sample the nucleotides corresponding to the motif from the motif model, and the rest of the sequence from the background model. For the purpose of clarity, we'll also make a number of simplifying assumptions in this tutorial: all sequences will be of the same length, and the motif length will be assumed to be known in advance.
We can define the model of a motif by a position frequency matrix. It is a matrix that for each position in the motif stores the probability of encountering a particular nucleotide at that position, and, so, serves as a generative model for the motif. This matrix can also be thought of as defining the structure of the pattern we're looking for, so inferring it will give us the distribution over the patterns possible in light of the data. Since we're going to infer the position frequency matrix, it is a random variable, and, therefore, needs a prior. Every row of the matrix is a probability vector defining a discrete distribution over the nucleotides, so we can use Dirichlet distribution as a prior of a row. The Dirichlet distribution is a distribution over probability vectors. When used as a prior, it effectively specifies how many times a certain outcome (a particular nucleotide at a given motif position in our case) has been observed "in the past", before we got any data about the probability vector. These numbers, called pseudo-counts, can take any positive real value, not just integer values.
In this code Our next step is to define variables for nucleotide sequences and motif positions in every sequence.
Now we have all we need to define the generative process for a sequence. First, we will create a string variable containing the motif string for a particular sequence. We can do it by first sampling a character array using the position frequency matrix, and then converting this array to a string:
The
The ## ResultsOne way to test a model is to apply it to the data that has been generated using that model. It is in fact a very important test, one of the first a model developer should perform. Inference failure when observed data is known to follow the model may indicate a bad schedule, a bug in the model definition or even in the implementation of the inference engine itself. Another possibility is that the model in question has some pathology caused by, say, non-identifiable variables. Luckily, our motif finder doesn't have any such issues. The inference engine is able to reconstruct the position frequency matrix to a reasonable precision, and the inferred motif positions are mostly correct. The only error is caused by the background being more similar to the expected pattern than the motif itself. ## Motif presence and absence
String random variables, as all other Infer.NET modeling elements, can be used with control flow statements to define mixtures. In order to try this, let's additionally assume that we expect some known percentage of sequences not to contain the pattern we're looking for: maybe the function shared by the sequences isn't always implemented in the same way, or some sequences got in the set by mistake. We can easily change the model to handle that by defining another array of random variables,
which will store whether the motif is present in a particular sequence. Now
we can wrap the previous definition of
and add an alternative definition for the case when a sequence contains no motif:
If we run inference on the data sampled from the model itself, for instance with 20% of sequences sampled from the background model, here is what we will get: The number of errors has slightly increased due to the fact that motif strings that happen not to follow the expected pattern precisely can now be also interpreted as background. Nevertheless the position frequency matrix was inferred to a reasonable precision. In this tutorial we had a brief introduction to another area where probabilistic models involving strings can be useful, bioinformatics. We saw how to define a complex model combining string, arrays, integer arithmetic and control flow statements. |