1
by brian
clean slate |
1 |
.TH REGEX 7 "7 Feb 1994" |
2 |
.BY "Henry Spencer" |
|
3 |
.SH NAME |
|
4 |
regex \- POSIX 1003.2 regular expressions
|
|
5 |
.SH DESCRIPTION |
|
6 |
Regular expressions (``RE''s), |
|
7 |
as defined in POSIX 1003.2, come in two forms: |
|
8 |
modern REs (roughly those of |
|
9 |
.IR egrep ; |
|
10 |
1003.2 calls these ``extended'' REs) |
|
11 |
and obsolete REs (roughly those of |
|
12 |
.IR ed ; |
|
13 |
1003.2 ``basic'' REs). |
|
14 |
Obsolete REs mostly exist for backward compatibility in some old programs; |
|
15 |
they will be discussed at the end. |
|
16 |
1003.2 leaves some aspects of RE syntax and semantics open; |
|
17 |
`\(dg' marks decisions on these aspects that
|
|
18 |
may not be fully portable to other 1003.2 implementations. |
|
19 |
.PP
|
|
20 |
A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR, |
|
21 |
separated by `|'. |
|
22 |
It matches anything that matches one of the branches. |
|
23 |
.PP
|
|
24 |
A branch is one\(dg or more \fIpieces\fR, concatenated. |
|
25 |
It matches a match for the first, followed by a match for the second, etc. |
|
26 |
.PP
|
|
27 |
A piece is an \fIatom\fR possibly followed |
|
28 |
by a single\(dg `*', `+', `?', or \fIbound\fR. |
|
29 |
An atom followed by `*' matches a sequence of 0 or more matches of the atom. |
|
30 |
An atom followed by `+' matches a sequence of 1 or more matches of the atom. |
|
31 |
An atom followed by `?' matches a sequence of 0 or 1 matches of the atom. |
|
32 |
.PP
|
|
33 |
A \fIbound\fR is `{' followed by an unsigned decimal integer, |
|
34 |
possibly followed by `,' |
|
35 |
possibly followed by another unsigned decimal integer, |
|
36 |
always followed by `}'. |
|
37 |
The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
|
|
38 |
and if there are two of them, the first may not exceed the second. |
|
39 |
An atom followed by a bound containing one integer \fIi\fR |
|
40 |
and no comma matches |
|
41 |
a sequence of exactly \fIi\fR matches of the atom. |
|
42 |
An atom followed by a bound |
|
43 |
containing one integer \fIi\fR and a comma matches |
|
44 |
a sequence of \fIi\fR or more matches of the atom. |
|
45 |
An atom followed by a bound |
|
46 |
containing two integers \fIi\fR and \fIj\fR matches |
|
47 |
a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. |
|
48 |
.PP
|
|
49 |
An atom is a regular expression enclosed in `()' (matching a match for the |
|
50 |
regular expression), |
|
51 |
an empty set of `()' (matching the null string)\(dg,
|
|
52 |
a \fIbracket expression\fR (see below), `.' |
|
53 |
(matching any single character), `^' (matching the null string at the |
|
54 |
beginning of a line), `$' (matching the null string at the |
|
55 |
end of a line), a `\e' followed by one of the characters
|
|
56 |
`^.[$()|*+?{\e'
|
|
57 |
(matching that character taken as an ordinary character), |
|
58 |
a `\e' followed by any other character\(dg |
|
59 |
(matching that character taken as an ordinary character, |
|
60 |
as if the `\e' had not been present\(dg), |
|
61 |
or a single character with no other significance (matching that character). |
|
62 |
A `{' followed by a character other than a digit is an ordinary |
|
63 |
character, not the beginning of a bound\(dg.
|
|
64 |
It is illegal to end an RE with `\e'.
|
|
65 |
.PP
|
|
66 |
A \fIbracket expression\fR is a list of characters enclosed in `[]'. |
|
67 |
It normally matches any single character from the list (but see below). |
|
68 |
If the list begins with `^', |
|
69 |
it matches any single character |
|
70 |
(but see below) \fInot\fR from the rest of the list. |
|
71 |
If two characters in the list are separated by `\-', this is shorthand
|
|
72 |
for the full \fIrange\fR of characters between those two (inclusive) in the |
|
73 |
collating sequence, |
|
74 |
e.g. `[0-9]' in ASCII matches any decimal digit. |
|
75 |
It is illegal\(dg for two ranges to share an
|
|
76 |
endpoint, e.g. `a-c-e'. |
|
77 |
Ranges are very collating-sequence-dependent, |
|
78 |
and portable programs should avoid relying on them. |
|
79 |
.PP
|
|
80 |
To include a literal `]' in the list, make it the first character |
|
81 |
(following a possible `^'). |
|
82 |
To include a literal `\-', make it the first or last character,
|
|
83 |
or the second endpoint of a range. |
|
84 |
To use a literal `\-' as the first endpoint of a range,
|
|
85 |
enclose it in `[.' and `.]' to make it a collating element (see below). |
|
86 |
With the exception of these and some combinations using `[' (see next |
|
87 |
paragraphs), all other special characters, including `\e', lose their
|
|
88 |
special significance within a bracket expression. |
|
89 |
.PP
|
|
90 |
Within a bracket expression, a collating element (a character, |
|
91 |
a multi-character sequence that collates as if it were a single character, |
|
92 |
or a collating-sequence name for either) |
|
93 |
enclosed in `[.' and `.]' stands for the |
|
94 |
sequence of characters of that collating element. |
|
95 |
The sequence is a single element of the bracket expression's list. |
|
96 |
A bracket expression containing a multi-character collating element |
|
97 |
can thus match more than one character, |
|
98 |
e.g. if the collating sequence includes a `ch' collating element, |
|
99 |
then the RE `[[.ch.]]*c' matches the first five characters |
|
100 |
of `chchcc'. |
|
101 |
.PP
|
|
102 |
Within a bracket expression, a collating element enclosed in `[=' and |
|
103 |
`=]' is an equivalence class, standing for the sequences of characters |
|
104 |
of all collating elements equivalent to that one, including itself. |
|
105 |
(If there are no other equivalent collating elements, |
|
106 |
the treatment is as if the enclosing delimiters were `[.' and `.]'.) |
|
107 |
For example, if o and \o'o^' are the members of an equivalence class,
|
|
108 |
then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous. |
|
109 |
An equivalence class may not\(dg be an endpoint
|
|
110 |
of a range. |
|
111 |
.PP
|
|
112 |
Within a bracket expression, the name of a \fIcharacter class\fR enclosed |
|
113 |
in `[:' and `:]' stands for the list of all characters belonging to that |
|
114 |
class. |
|
115 |
Standard character class names are: |
|
116 |
.PP
|
|
117 |
.RS
|
|
118 |
.nf
|
|
119 |
.ta 3c 6c 9c |
|
120 |
alnum digit punct |
|
121 |
alpha graph space |
|
122 |
blank lower upper |
|
123 |
cntrl print xdigit |
|
124 |
.fi
|
|
125 |
.RE
|
|
126 |
.PP
|
|
127 |
These stand for the character classes defined in |
|
128 |
.IR ctype (3). |
|
129 |
A locale may provide others. |
|
130 |
A character class may not be used as an endpoint of a range. |
|
131 |
.PP
|
|
132 |
There are two special cases\(dg of bracket expressions:
|
|
133 |
the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at |
|
134 |
the beginning and end of a word respectively. |
|
135 |
A word is defined as a sequence of |
|
136 |
word characters |
|
137 |
which is neither preceded nor followed by |
|
138 |
word characters. |
|
139 |
A word character is an |
|
140 |
.I alnum |
|
141 |
character (as defined by |
|
142 |
.IR ctype (3)) |
|
143 |
or an underscore. |
|
144 |
This is an extension, |
|
145 |
compatible with but not specified by POSIX 1003.2, |
|
146 |
and should be used with |
|
147 |
caution in software intended to be portable to other systems. |
|
148 |
.PP
|
|
149 |
In the event that an RE could match more than one substring of a given |
|
150 |
string, |
|
151 |
the RE matches the one starting earliest in the string. |
|
152 |
If the RE could match more than one substring starting at that point, |
|
153 |
it matches the longest. |
|
154 |
Subexpressions also match the longest possible substrings, subject to |
|
155 |
the constraint that the whole match be as long as possible, |
|
156 |
with subexpressions starting earlier in the RE taking priority over |
|
157 |
ones starting later. |
|
158 |
Note that higher-level subexpressions thus take priority over |
|
159 |
their lower-level component subexpressions. |
|
160 |
.PP
|
|
161 |
Match lengths are measured in characters, not collating elements. |
|
162 |
A null string is considered longer than no match at all. |
|
163 |
For example, |
|
164 |
`bb*' matches the three middle characters of `abbbc', |
|
165 |
`(wee|week)(knights|nights)' matches all ten characters of `weeknights', |
|
166 |
when `(.*).*' is matched against `abc' the parenthesized subexpression |
|
167 |
matches all three characters, and |
|
168 |
when `(a*)*' is matched against `bc' both the whole RE and the parenthesized |
|
169 |
subexpression match the null string. |
|
170 |
.PP
|
|
171 |
If case-independent matching is specified, |
|
172 |
the effect is much as if all case distinctions had vanished from the |
|
173 |
alphabet. |
|
174 |
When an alphabetic that exists in multiple cases appears as an |
|
175 |
ordinary character outside a bracket expression, it is effectively |
|
176 |
transformed into a bracket expression containing both cases, |
|
177 |
e.g. `x' becomes `[xX]'. |
|
178 |
When it appears inside a bracket expression, all case counterparts |
|
179 |
of it are added to the bracket expression, so that (e.g.) `[x]' |
|
180 |
becomes `[xX]' and `[^x]' becomes `[^xX]'. |
|
181 |
.PP
|
|
182 |
No particular limit is imposed on the length of REs\(dg.
|
|
183 |
Programs intended to be portable should not employ REs longer |
|
184 |
than 256 bytes, |
|
185 |
as an implementation can refuse to accept such REs and remain |
|
186 |
POSIX-compliant. |
|
187 |
.PP
|
|
188 |
Obsolete (``basic'') regular expressions differ in several respects. |
|
189 |
`|', `+', and `?' are ordinary characters and there is no equivalent |
|
190 |
for their functionality. |
|
191 |
The delimiters for bounds are `\e{' and `\e}', |
|
192 |
with `{' and `}' by themselves ordinary characters. |
|
193 |
The parentheses for nested subexpressions are `\e(' and `\e)', |
|
194 |
with `(' and `)' by themselves ordinary characters. |
|
195 |
`^' is an ordinary character except at the beginning of the |
|
196 |
RE or\(dg the beginning of a parenthesized subexpression,
|
|
197 |
`$' is an ordinary character except at the end of the |
|
198 |
RE or\(dg the end of a parenthesized subexpression,
|
|
199 |
and `*' is an ordinary character if it appears at the beginning of the |
|
200 |
RE or the beginning of a parenthesized subexpression |
|
201 |
(after a possible leading `^'). |
|
202 |
Finally, there is one new type of atom, a \fIback reference\fR: |
|
203 |
`\e' followed by a non-zero decimal digit \fId\fR |
|
204 |
matches the same sequence of characters |
|
205 |
matched by the \fId\fRth parenthesized subexpression |
|
206 |
(numbering subexpressions by the positions of their opening parentheses, |
|
207 |
left to right), |
|
208 |
so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'. |
|
209 |
.SH SEE ALSO |
|
210 |
regex(3) |
|
211 |
.PP
|
|
212 |
POSIX 1003.2, section 2.8 (Regular Expression Notation). |
|
213 |
.SH BUGS |
|
214 |
Having two kinds of REs is a botch. |
|
215 |
.PP
|
|
216 |
The current 1003.2 spec says that `)' is an ordinary character in |
|
217 |
the absence of an unmatched `('; |
|
218 |
this was an unintentional result of a wording error, |
|
219 |
and change is likely. |
|
220 |
Avoid relying on it. |
|
221 |
.PP
|
|
222 |
Back references are a dreadful botch, |
|
223 |
posing major problems for efficient implementations. |
|
224 |
They are also somewhat vaguely defined |
|
225 |
(does |
|
226 |
`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?). |
|
227 |
Avoid using them. |
|
228 |
.PP
|
|
229 |
1003.2's specification of case-independent matching is vague. |
|
230 |
The ``one case implies all cases'' definition given above |
|
231 |
is current consensus among implementors as to the right interpretation. |
|
232 |
.PP
|
|
233 |
The syntax for word boundaries is incredibly ugly. |