Strings Tutorial 2: StringFormat Operation
This page describes an experimental feature that is likely to change in future releases.
In this tutorial we'll take a look at a powerful string operation supported in Infer.NET, StringFormat, and the sorts of models one can define with it.
You can run the code in this tutorial either using the Examples Browser or by opening the Tutorials solution in Visual Studio and executing StringFormat.cs.
Inferring an argument
The StringFormat operation supported in Infer.NET is very similar to the String.Format method in .NET in both signature and semantics. There are a few subtle differences, which we'll discuss in the next section. One way of using StringFormat in a probabilistic model is to define a piece of text to be the result of a StringFormat call, and then try to work out what the arguments were given the text. We will demonstrate that by inferring the name of a person from a greeting text. Let's for now assume that we're going to fix the form of the greeting. It is also natural to assume that the name of a person starts with a capital letter, followed by a number of lowercase letters. That allows us to define the following (rather simple) model of some greeting text:
The Variable.StringCapitalized method creates a string random variable from a uniform distribution over all strings that start with an uppercase letter, followed by one or more lowercase letters. As it was the case with Variable.StringUniform, this is an improper distribution. Now, we can run inference on a piece of text and infer the name:
This code will output
Inferring the template
So far we've fixed the form of the greeting. We can instead try to learn it by simultaneously working out what the template and the name are from the text. That will require us to specify a prior distribution over the template. It would be reasonable to say that the template should have a name placeholder somewhere, surrounded by non-word characters like a space, a full stop, or a comma. That leads us to the following model:
Variable.CharNonWord creates a character random variable from a uniform distribution over all characters that cannot be a part of a word. As in C#, character variables can be concatenated with strings to produce other strings. If we now observe the text to be "Hello, mate! I'm Dave." and run inference, Infer.NET will work out that:
"Hello" is not considered a possible name because "H" is the first character in the text and thus there is no non-word character before it. "I'm" also cannot be a name under our model because Variable.StringCapitalized only allows for letters. But in order to further understand the inference results, it's now worth stating how the Infer.NET StringFormat is different from the .NET String.Format. In the Infer.NET StringFormat
The first bullet is quite relevant to our example. If StringFormat allowed for the placeholder not to be present in the format string at all, it would have led to the possibility that the template is "Hello, mate! I'm Dave." and the text provides no information about the name at all, which is undesirable. Requiring each placeholder to be present in the format string exactly once helps to reduce the ambiguity of backward reasoning, and in our case allows us to unambiguously determine the template. Nevertheless, versions of StringFormat that relax this limitation may be added to Infer.NET in future releases.
Even with the placeholder presence restriction in place, the prior we've used for the template wouldn't always lead to unambiguous inference. For instance, if the observed text is "Hi! My name is John.", the results of the inference will be
That is because "My" starts with a capital letter and is surrounded by non-word characters, just as for "John". In order to handle this ambiguity, we can either improve the prior over the template, or provide more data:
Now Infer.NET can unambiguously work everything out:
Using the learned template
It is possible to combine forward and backward reasoning within the same model. For instance, we may want to generate a greeting text for a yet another name without explicitly providing the template, but inferring it from the data instead. To achieve this goal, all we have to do is define another text variable:
In this tutorial we saw how to use the StringFormat operation to both extract values from a piece of text and learn templates from one or more pieces of text. These are both very useful techniques for processing natural text strings. We can now move to the next tutorial, where we'll see how to combine integer, boolean, character array and string random variables to define a complex probabilistic model from bioinformatics.