2.3. Segmenting with regexes

This tutorial has already shown how to use the Segment widget to segment text into words, letters, or lines thanks to the drop-down menu options.

Interface of widget Segment configured with regex "\\w+"

Figure 1: Interface of the Segment widget, configured for word segmentation

As a matter of fact, these options in the interface of Segment rely internally on the use of regular expressions. For instance, the Segment into words option uses regex \w+. It divides each incoming segment into sequences of alphanumeric characters (and underscores)–which in our case amounts to segmenting a simple example into three words. Similarly, regex \w is used to obtain a segmentation into letters (or to be precise, alphanumeric characters or underscores).

With some knowledge of regular expressions, you can exploit the Use a regular expression option in the drop-down menu to do more specific queries. If the relevant unit is the word, regexes will often use the \b anchor, which represents a word boundary. For instance, words that contain less than 4 characters can be retrieved with \b\w{1,3}\b, those ending in -tion with \b\w+tion\b, and the inflected forms of retrieve with \bretriev(e|es|ed|ing)\b.

Interface of widget Segment configured with regex "\\b\\w{1,3}\\b"

Figure 2: Using a Regular Expression (\b\w{1,3}\b) with the Segment widget

In these examples, the same result can be achieved by first using the built-in Segment into words option and filtering the result with the Select widget (see Filtering segmentations using regexes). However, doing it in one step with Segment is more effective in terms of computation time. Besides, it makes it possible to capture text fragments that are larger than words, e.g. multi-word expressions.

To go further, you can add several regexes at the same time by ticking the Advanced settings checkbox (see figure 3 below). Regexes can then be used to describe the resulting tokens, as in the basic mode or, depending on your research goal, to describe the delimiters occurring between the resulting tokens (selecting mode Split instead of Tokenize). For more information, see Segment widget.

Interface of widget Segment with Advanced Settings checked

Figure 3: Interface of the Segment widget with Advanced settings checked

In several other widgets (Text Files, URLs, and Recode), Advanced settings also allows you to switch from a single manipulation to a series of operations. In most widgets, they give access to additional useful options.

See also