Wednesday, November 25, 2015

Beyond Punctuation: Creating Custom Segmentation Rules in Studio

A freshly created TM in Studio comes with 3 standard segmentation rules, as shown below.


While these rules are enough in most cases, sometimes we'll open a file and wish it were segmented differently, as in this case:


A quick look at this file makes it clear that it would be a lot easier to handle if "Dry Time" and "Wait Time" were in their own separate segments, so a custom segmentation rule would come in handy.

This new rule won't be punctuation-related, but instead, it will be content-related. In other words, I need Studio to create a new segment whenever the text "Wait Time:" or "Dry Time:" is found.

Adding a New Segmentation Rule

To access the segmentation rules, follow the path shown below.


This opens the Segmentation Rules window, where we will add the new rule.


Clicking Advanced View takes us to this window:


This is where we will tell Studio what we want to do. Before proceeding, let's think about what we want to do.


As shown above, we want to add a segment break (represented by the yellow line) right before "Wait Time:" and "Dry Time:", both of which are preceded by a space. In the window above, I need to tell Studio what pattern can be found before the (segment) break and after the break. So, in this example:

Before the break there is a space

and

After the break there is either "Wait Time:" or "Dry Time:"

To tell Studio what I want to do, I will need to use regular expressions, which for this example are not too complicated.


Explanation:

Before break
\s

  • \s is the regular expression character for whitespace


After break
(Wait|Dry) Time:


  • The | indicates alternation, so it's telling Studio to look for either "Wait" or "Dry"
  • The parentheses are used to group the two alternatives, as otherwise Studio would look for "Wait" or "Dry Time:", that is, it would not combine "Wait" and "Time:"
Note that I'm including the colon in the "After break" expression. This is to prevent unwanted segmentation in segments like "Wait Time cannot be longer than Dry Time."

After clicking OK, the rule is now included in the list of segmentation rules.



After closing all the open windows. The new rule is now available to be applied during processing. 

To apply it to my file, I will need to first remove the file from my project, add it again and process it as usual, as shown in this short video.



And that's all there is to it! After re-processing the file, I now have the segmentation I wanted.