Cold Help System: Programming: ColdC Reference Manual: Implementation: Regexps

ColdC Regular Expressions use Henry Spencer's Regular Expression package with further extensions similar to those Perl has implemented.

A Regular Expression is an abstract way of matching text. The simplest Regular Expression is a direct match. For instance, the string "that" exists within the string "this and that are here.", therefore "that" is a Regular Expression.

However, regular expressions can be much more complex than this case, as there are many possibilities which may be matched in strings. Wildcard matching is a common way of matching more than a simple instance of text. Wildcard matching generally matches any number of anything where a `*' is found in the wildcard. Although useful it does have its restrictions. Wildcard matching is not used in Regular Expressions because of its lack of control.

The following is a list of all possible special characters in a regular expression which have special meaning:

\ Escape Mechanism, changes the meaning of the next character
^ Match the beginning of a string, or if at the beginning of a range, logically inverts the value (i.e. anything NOT in this range)
$ Match the end of a string
| Branch Seperator
() Grouping
[] Range
. Match any one character
* Match zero or more of the previous
+ Match one or more of the previous
? Match zero or one of the previous

ColdC Regular Expressions Explained

The first concept of Regular Expressions are branches. There can be zero or more branches in a Regular Expression, seperated by the pipe character ("|"). A Regular Expression will match anything in one of the branches. An example of this is:


This Regular Expression will match "this" OR "that" OR "there". It is easiest to logically think of branches in this manner.

A branch is further defined as zero or more pieces joined together. A piece is an atom possibly followed by an asterisk, a plus sign, or a question mark ("*", "+", or "?"). The asterisk, plus sign, or question mark defines how to match the atom. An atom followed by an asterisk matches zero or more occurances of the atom. An atom followed by a plus sign matches one or more occurances of the atom. An atom followed by a question mark matches zero or one occurance of the atom.

An atom is either a Regular Expression Group or Range (see below), or one of the following: a period ("."), a carat ("^"), a dollar sign ("$"), a back-slash ("\") followed by a single character, or a single character with no other significance. A period matches any single character in the input text, a carat matches the beginning of the input text, a dollar-sign matches the end of the input text, and a back-slash followed by a single character either has special significance--such as matching all white space or all digits--or it removes special significance from the following character. For instance, "$" would match a dollar-sign in the input text, rather than matching the end of the line (which is what the dollar-sign usually does).

A Group is anything enclosed within a set of parenthesis ("()"). Groups will help to clarify how the Regular Expression should match, in addition to what results should be returned from the Regular Expression. If one or more groups exist, the result of the regular expression will include the groups, possibly in addition to the entire area matched in the string (what is returned will depend upon the function).

A range is a sequence of characters enclosed in square brackets ("[]"). It normally matches any single character contained within the range sequence. Characters which normally have special significance (such as a dollar-sign, period and a back-slash) loose that significance when enclosed in a range. However, a range has its own special characters. If the range begins with a carat ("^"), it matches any single character which is not in the sequence. If two characters in the sequence are separated by a dash ("-"), the full range of ASCII characters between the two are matched, including the two. For instance, "[0-9]" matches any single decimal digit from zero to nine. To include a literal square bracket ("]") in the sequence do not use the back-slash to escape it (as the back-slash does not have special meaning when in a range), instead place it at the begining of the range (following a possible "^"). To include a literal dash ("-") place it the start or end of the range.

Consider the following examples:

a? Match zero or one 'a' characters
(this|that)* Match zero or more occurances of "this" or "that"
[a-z]+ Match one or more occurances of any alphabetic character (a through z)
[^0-9]? Match zero or one occurance of any non-digit character

The following characters have special meaning when matching (similar to PERL Regular Expressions):

"\w" Match a word word character (alphanumeric plus "_")
"\W" Match a non-word character
"\s" Match a whitespace character
"\S" Match a non-whitespace character
"\d" Match a digit character
"\D" Match a non-digit character

Note: the above escape characters have not yet been integrated into the regular expression matcher.

Methods | Tasks and Frames | Errors | Security | Networking | Regexps | Files

the Cold Dark