## Strings Tutorial 1: Hello, Strings!

This tutorial introduces the basics of performing inference over string variables in Infer.NET. It shows how to define a generative process that includes strings and how to reason about variables involved in that process.

You can run the code in this tutorial either using the
Examples Browser or by opening the Tutorials solution in Visual Studio
and executing **HelloStrings.cs**.

### A generative model of text

Probabilistic models we've considered so far had variables of numeric types only, i.e. integers, floats and booleans. In principle, however, there is no reason to restrict model variables to these domains: as long as the inference engine is able to handle variables of a certain type, such variables should be allowed in the modelling code. The Infer.NET inference engine, in particular, also supports variables of collection types, such as strings or lists. In this tutorial we will focus on string variables.

One way to define a string random variable is to specify a prior distribution over it:

```
Variable<string>
str1 = Variable.StringUniform().Named("str1");
``` |

See also:
Creating variables |

**Variable****.StringUniform** creates
a string random variable from a uniform distribution over all possible strings.
So, both **str1** and **str2** can potentially take
any value, and all values are equally likely. It should be noted that since the
number of all possible strings is infinite, this distribution is
improper. However, in many models improper priors don't constitute a problem
since the posterior distribution over variables with an improper prior can still
be proper.

Another way to obtain a string random variable is to invoke an operation that produces a string, such as, for instance, concatenation:

```
Variable<string> text = (str1 +
" " + str2).Named("text");
``` |

So, **text **is defined to be a concatenation of **str1**,
a string containing a single space, and **str2**.

The model we've just defined can be thought of as the following generative process: take any two strings and concatenate them, putting a space in between. Another possible interpretation is a parsing process that accepts only strings that contain at least one space. If you are familiar with regular expressions, such a process can be represented by an expression of the form ".* .*".

### Uncertain segmentation

Now that we have a model, we are ready to observe some data and make an
inference about its variables. In particular, let us observe the value of **
text** and try to figure out what the values of **str1** and **
str2** are. To observe the value of **text**, we, as before,
need to set its **ObservedValue** property:

```
text.ObservedValue =
"Hello uncertain world";
``` |

Note that it's not clear from the value of **text** what **
str1** and **str2** are: the whitespace between **str1**
and **str2** can correspond to either the first or the second space
in the observed string. The segmentation of **text** into **
str1** and **str2** is, therefore, subject to uncertainty.
And this is precisely the conclusion that the Infer.NET inference engine will
reach if we run it on this model, as we will see soon.

```
var
engine = new
InferenceEngine();
``` |

See also:
Quality bands |

A couple of important things to note is that a) inference over strings is currently supported only with the expectation propagation algorithm and b) it's currently considered to be an experimental feature, so, to prevent the model compiler from emitting warnings about using experimental components, the recommended quality level must be amended. Running this code will produce the following output:

```
str1:
Hello[ uncertain]
``` |

Currently when a distribution over strings is printed to the console (or **
ToString** is called on it), the result is a compact representation of
the set of all strings that are possible under that distribution, also known as
the support of the distribution. So, if we print the posterior distribution over **
str1** and **str2**, we can immediately see that, given the
observed text, the value of **str1**
used to produce it could have been "Hello" or "Hello uncertain", while **
str2** could have been "world" or "uncertain world", as discussed above.

### StringAutomaton

Inspecting the support of the posterior is not, however, usually sufficient.
For a more detailed analysis, say, seeing how likely a particular string is
under the distribution, one needs to obtain a distribution object. For string
random variables the corresponding object is always of type **
StringDistribution**. The **StringDistribution** class
is an implementation of a distribution over strings that represents uncertainty
via a
weighted finite state automaton internally. As with other distribution classes,
it has methods for sampling and retrieving the probability of a given string.
Thus, we can write the following code:

```
var
distOfStr1 = engine.Infer<StringDistribution>(str1);
``` |

And, as expected, it will produce this output:

```
P(str1 =
'Hello') = 0.5
``` |

We are now ready to move to the next tutorial, where we'll see some other supported operations over strings and use them to define a more sophisticated model.