2.3. Segmenting with regexes¶
This tutorial has already shown how to use the Segment widget to segment text into words, letters, or lines thanks to the drop-down menu options.

Figure 1: Interface of the Segment widget, configured for word segmentation¶
As a matter of fact, these options in the interface of Segment
rely internally on the use of regular expressions. For instance, the Segment
into words option uses regex \w+
. It divides each incoming segment into
sequences of alphanumeric characters (and underscores)–which in our case amounts
to segmenting a simple example into three words. Similarly, regex \w
is
used to obtain a segmentation into letters (or to be precise, alphanumeric
characters or underscores).
With some knowledge of regular expressions, you can exploit the Use a regular
expression option in the drop-down menu to do more specific queries. If the
relevant unit is the word, regexes will often use the \b
anchor, which
represents a word boundary. For instance, words that contain less than 4
characters can be retrieved with \b\w{1,3}\b
, those ending in -tion with
\b\w+tion\b
, and the inflected forms of retrieve with
\bretriev(e|es|ed|ing)\b
.

Figure 2: Using a Regular Expression (\b\w{1,3}\b
) with the Segment widget¶
In these examples, the same result can be achieved by first using the built-in Segment into words option and filtering the result with the Select widget (see Filtering segmentations using regexes). However, doing it in one step with Segment is more effective in terms of computation time. Besides, it makes it possible to capture text fragments that are larger than words, e.g. multi-word expressions.
To go further, you can add several regexes at the same time by ticking the Advanced settings checkbox (see figure 3 below). Regexes can then be used to describe the resulting tokens, as in the basic mode or, depending on your research goal, to describe the delimiters occurring between the resulting tokens (selecting mode Split instead of Tokenize). For more information, see Segment widget.

Figure 3: Interface of the Segment widget with Advanced settings checked¶
In several other widgets (Text Files, URLs, and Recode), Advanced settings also allows you to switch from a single manipulation to a series of operations. In most widgets, they give access to additional useful options.