1
by brian
clean slate |
1 |
.TH REGEX 3 "17 May 1993" |
2 |
.BY "Henry Spencer" |
|
3 |
.de ZR |
|
4 |
.\" one other place knows this name: the SEE ALSO section |
|
5 |
.IR regex (7) \\$1 |
|
6 |
.. |
|
7 |
.SH NAME |
|
8 |
regcomp, regexec, regerror, regfree \- regular-expression library
|
|
9 |
.SH SYNOPSIS |
|
10 |
.ft B |
|
11 |
.\".na |
|
12 |
#include <sys/types.h> |
|
13 |
.br
|
|
14 |
#include <regex.h> |
|
15 |
.HP 10 |
|
16 |
int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags); |
|
17 |
.HP
|
|
18 |
int\ regexec(const\ regex_t\ *preg, const\ char\ *string, |
|
19 |
size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags); |
|
20 |
.HP
|
|
21 |
size_t\ regerror(int\ errcode, const\ regex_t\ *preg, |
|
22 |
char\ *errbuf, size_t\ errbuf_size); |
|
23 |
.HP
|
|
24 |
void\ regfree(regex_t\ *preg); |
|
25 |
.\".ad |
|
26 |
.ft
|
|
27 |
.SH DESCRIPTION |
|
28 |
These routines implement POSIX 1003.2 regular expressions (``RE''s); |
|
29 |
see |
|
30 |
.ZR . |
|
31 |
.I Regcomp |
|
32 |
compiles an RE written as a string into an internal form, |
|
33 |
.I regexec |
|
34 |
matches that internal form against a string and reports results, |
|
35 |
.I regerror |
|
36 |
transforms error codes from either into human-readable messages, |
|
37 |
and |
|
38 |
.I regfree |
|
39 |
frees any dynamically-allocated storage used by the internal form |
|
40 |
of an RE. |
|
41 |
.PP
|
|
42 |
The header |
|
43 |
.I <regex.h> |
|
44 |
declares two structure types, |
|
45 |
.I regex_t |
|
46 |
and |
|
47 |
.IR regmatch_t , |
|
48 |
the former for compiled internal forms and the latter for match reporting. |
|
49 |
It also declares the four functions, |
|
50 |
a type |
|
51 |
.IR regoff_t , |
|
52 |
and a number of constants with names starting with ``REG_''. |
|
53 |
.PP
|
|
54 |
.I Regcomp |
|
55 |
compiles the regular expression contained in the |
|
56 |
.I pattern |
|
57 |
string, |
|
58 |
subject to the flags in |
|
59 |
.IR cflags , |
|
60 |
and places the results in the |
|
61 |
.I regex_t |
|
62 |
structure pointed to by |
|
63 |
.IR preg . |
|
64 |
.I Cflags |
|
65 |
is the bitwise OR of zero or more of the following flags: |
|
66 |
.IP REG_EXTENDED \w'REG_EXTENDED'u+2n |
|
67 |
Compile modern (``extended'') REs, |
|
68 |
rather than the obsolete (``basic'') REs that |
|
69 |
are the default. |
|
70 |
.IP REG_BASIC |
|
71 |
This is a synonym for 0, |
|
72 |
provided as a counterpart to REG_EXTENDED to improve readability. |
|
73 |
.IP REG_NOSPEC |
|
74 |
Compile with recognition of all special characters turned off. |
|
75 |
All characters are thus considered ordinary, |
|
76 |
so the ``RE'' is a literal string. |
|
77 |
This is an extension, |
|
78 |
compatible with but not specified by POSIX 1003.2, |
|
79 |
and should be used with |
|
80 |
caution in software intended to be portable to other systems. |
|
81 |
REG_EXTENDED and REG_NOSPEC may not be used |
|
82 |
in the same call to |
|
83 |
.IR regcomp . |
|
84 |
.IP REG_ICASE |
|
85 |
Compile for matching that ignores upper/lower case distinctions. |
|
86 |
See |
|
87 |
.ZR . |
|
88 |
.IP REG_NOSUB |
|
89 |
Compile for matching that need only report success or failure, |
|
90 |
not what was matched. |
|
91 |
.IP REG_NEWLINE |
|
92 |
Compile for newline-sensitive matching. |
|
93 |
By default, newline is a completely ordinary character with no special |
|
94 |
meaning in either REs or strings. |
|
95 |
With this flag, |
|
96 |
`[^' bracket expressions and `.' never match newline, |
|
97 |
a `^' anchor matches the null string after any newline in the string |
|
98 |
in addition to its normal function, |
|
99 |
and the `$' anchor matches the null string before any newline in the |
|
100 |
string in addition to its normal function. |
|
101 |
.IP REG_PEND |
|
102 |
The regular expression ends, |
|
103 |
not at the first NUL, |
|
104 |
but just before the character pointed to by the |
|
105 |
.I re_endp |
|
106 |
member of the structure pointed to by |
|
107 |
.IR preg . |
|
108 |
The |
|
109 |
.I re_endp |
|
110 |
member is of type |
|
111 |
.IR const\ char\ * . |
|
112 |
This flag permits inclusion of NULs in the RE; |
|
113 |
they are considered ordinary characters. |
|
114 |
This is an extension, |
|
115 |
compatible with but not specified by POSIX 1003.2, |
|
116 |
and should be used with |
|
117 |
caution in software intended to be portable to other systems. |
|
118 |
.PP
|
|
119 |
When successful, |
|
120 |
.I regcomp |
|
121 |
returns 0 and fills in the structure pointed to by |
|
122 |
.IR preg . |
|
123 |
One member of that structure |
|
124 |
(other than |
|
125 |
.IR re_endp ) |
|
126 |
is publicized: |
|
127 |
.IR re_nsub , |
|
128 |
of type |
|
129 |
.IR size_t , |
|
130 |
contains the number of parenthesized subexpressions within the RE |
|
131 |
(except that the value of this member is undefined if the |
|
132 |
REG_NOSUB flag was used). |
|
133 |
If |
|
134 |
.I regcomp |
|
135 |
fails, it returns a non-zero error code; |
|
136 |
see DIAGNOSTICS. |
|
137 |
.PP
|
|
138 |
.I Regexec |
|
139 |
matches the compiled RE pointed to by |
|
140 |
.I preg |
|
141 |
against the |
|
142 |
.IR string , |
|
143 |
subject to the flags in |
|
144 |
.IR eflags , |
|
145 |
and reports results using |
|
146 |
.IR nmatch , |
|
147 |
.IR pmatch , |
|
148 |
and the returned value. |
|
149 |
The RE must have been compiled by a previous invocation of |
|
150 |
.IR regcomp . |
|
151 |
The compiled form is not altered during execution of |
|
152 |
.IR regexec , |
|
153 |
so a single compiled RE can be used simultaneously by multiple threads. |
|
154 |
.PP
|
|
155 |
By default, |
|
156 |
the NUL-terminated string pointed to by |
|
157 |
.I string |
|
158 |
is considered to be the text of an entire line, minus any terminating |
|
159 |
newline. |
|
160 |
The |
|
161 |
.I eflags |
|
162 |
argument is the bitwise OR of zero or more of the following flags: |
|
163 |
.IP REG_NOTBOL \w'REG_STARTEND'u+2n |
|
164 |
The first character of |
|
165 |
the string |
|
166 |
is not the beginning of a line, so the `^' anchor should not match before it. |
|
167 |
This does not affect the behavior of newlines under REG_NEWLINE. |
|
168 |
.IP REG_NOTEOL |
|
169 |
The NUL terminating |
|
170 |
the string |
|
171 |
does not end a line, so the `$' anchor should not match before it. |
|
172 |
This does not affect the behavior of newlines under REG_NEWLINE. |
|
173 |
.IP REG_STARTEND |
|
174 |
The string is considered to start at |
|
175 |
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR |
|
176 |
and to have a terminating NUL located at |
|
177 |
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR |
|
178 |
(there need not actually be a NUL at that location), |
|
179 |
regardless of the value of |
|
180 |
.IR nmatch . |
|
181 |
See below for the definition of |
|
182 |
.IR pmatch |
|
183 |
and |
|
184 |
.IR nmatch . |
|
185 |
This is an extension, |
|
186 |
compatible with but not specified by POSIX 1003.2, |
|
187 |
and should be used with |
|
188 |
caution in software intended to be portable to other systems. |
|
189 |
Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL; |
|
190 |
REG_STARTEND affects only the location of the string, |
|
191 |
not how it is matched. |
|
192 |
.PP
|
|
193 |
See |
|
194 |
.ZR
|
|
195 |
for a discussion of what is matched in situations where an RE or a |
|
196 |
portion thereof could match any of several substrings of |
|
197 |
.IR string . |
|
198 |
.PP
|
|
199 |
Normally, |
|
200 |
.I regexec |
|
201 |
returns 0 for success and the non-zero code REG_NOMATCH for failure. |
|
202 |
Other non-zero error codes may be returned in exceptional situations; |
|
203 |
see DIAGNOSTICS. |
|
204 |
.PP
|
|
205 |
If REG_NOSUB was specified in the compilation of the RE, |
|
206 |
or if |
|
207 |
.I nmatch |
|
208 |
is 0, |
|
209 |
.I regexec |
|
210 |
ignores the |
|
211 |
.I pmatch |
|
212 |
argument (but see below for the case where REG_STARTEND is specified). |
|
213 |
Otherwise, |
|
214 |
.I pmatch |
|
215 |
points to an array of |
|
216 |
.I nmatch |
|
217 |
structures of type |
|
218 |
.IR regmatch_t . |
|
219 |
Such a structure has at least the members |
|
220 |
.I rm_so |
|
221 |
and |
|
222 |
.IR rm_eo , |
|
223 |
both of type |
|
224 |
.I regoff_t |
|
225 |
(a signed arithmetic type at least as large as an |
|
226 |
.I off_t |
|
227 |
and a |
|
228 |
.IR ssize_t ), |
|
229 |
containing respectively the offset of the first character of a substring |
|
230 |
and the offset of the first character after the end of the substring. |
|
231 |
Offsets are measured from the beginning of the |
|
232 |
.I string |
|
233 |
argument given to |
|
234 |
.IR regexec . |
|
235 |
An empty substring is denoted by equal offsets, |
|
236 |
both indicating the character following the empty substring. |
|
237 |
.PP
|
|
238 |
The 0th member of the |
|
239 |
.I pmatch |
|
240 |
array is filled in to indicate what substring of |
|
241 |
.I string |
|
242 |
was matched by the entire RE. |
|
243 |
Remaining members report what substring was matched by parenthesized |
|
244 |
subexpressions within the RE; |
|
245 |
member |
|
246 |
.I i |
|
247 |
reports subexpression |
|
248 |
.IR i , |
|
249 |
with subexpressions counted (starting at 1) by the order of their opening |
|
250 |
parentheses in the RE, left to right. |
|
251 |
Unused entries in the array\(emcorresponding either to subexpressions that
|
|
252 |
did not participate in the match at all, or to subexpressions that do not |
|
253 |
exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both |
|
254 |
.I rm_so |
|
255 |
and |
|
256 |
.I rm_eo |
|
257 |
set to \-1.
|
|
258 |
If a subexpression participated in the match several times, |
|
259 |
the reported substring is the last one it matched. |
|
260 |
(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', |
|
261 |
the parenthesized subexpression matches each of the three `b's and then |
|
262 |
an infinite number of empty strings following the last `b', |
|
263 |
so the reported substring is one of the empties.) |
|
264 |
.PP
|
|
265 |
If REG_STARTEND is specified, |
|
266 |
.I pmatch |
|
267 |
must point to at least one |
|
268 |
.I regmatch_t |
|
269 |
(even if |
|
270 |
.I nmatch |
|
271 |
is 0 or REG_NOSUB was specified), |
|
272 |
to hold the input offsets for REG_STARTEND. |
|
273 |
Use for output is still entirely controlled by |
|
274 |
.IR nmatch ; |
|
275 |
if |
|
276 |
.I nmatch |
|
277 |
is 0 or REG_NOSUB was specified, |
|
278 |
the value of |
|
279 |
.IR pmatch [0] |
|
280 |
will not be changed by a successful |
|
281 |
.IR regexec . |
|
282 |
.PP
|
|
283 |
.I Regerror |
|
284 |
maps a non-zero |
|
285 |
.I errcode |
|
286 |
from either |
|
287 |
.I regcomp |
|
288 |
or |
|
289 |
.I regexec |
|
290 |
to a human-readable, printable message. |
|
291 |
If |
|
292 |
.I preg |
|
293 |
is non-NULL, |
|
294 |
the error code should have arisen from use of |
|
295 |
the |
|
296 |
.I regex_t |
|
297 |
pointed to by |
|
298 |
.IR preg , |
|
299 |
and if the error code came from |
|
300 |
.IR regcomp , |
|
301 |
it should have been the result from the most recent |
|
302 |
.I regcomp |
|
303 |
using that |
|
304 |
.IR regex_t . |
|
305 |
.RI ( Regerror |
|
306 |
may be able to supply a more detailed message using information |
|
307 |
from the |
|
308 |
.IR regex_t .) |
|
309 |
.I Regerror |
|
310 |
places the NUL-terminated message into the buffer pointed to by |
|
311 |
.IR errbuf , |
|
312 |
limiting the length (including the NUL) to at most |
|
313 |
.I errbuf_size |
|
314 |
bytes. |
|
315 |
If the whole message won't fit, |
|
316 |
as much of it as will fit before the terminating NUL is supplied. |
|
317 |
In any case, |
|
318 |
the returned value is the size of buffer needed to hold the whole |
|
319 |
message (including terminating NUL). |
|
320 |
If |
|
321 |
.I errbuf_size |
|
322 |
is 0, |
|
323 |
.I errbuf |
|
324 |
is ignored but the return value is still correct. |
|
325 |
.PP
|
|
326 |
If the |
|
327 |
.I errcode |
|
328 |
given to |
|
329 |
.I regerror |
|
330 |
is first ORed with REG_ITOA, |
|
331 |
the ``message'' that results is the printable name of the error code, |
|
332 |
e.g. ``REG_NOMATCH'', |
|
333 |
rather than an explanation thereof. |
|
334 |
If |
|
335 |
.I errcode |
|
336 |
is REG_ATOI, |
|
337 |
then |
|
338 |
.I preg |
|
339 |
shall be non-NULL and the |
|
340 |
.I re_endp |
|
341 |
member of the structure it points to |
|
342 |
must point to the printable name of an error code; |
|
343 |
in this case, the result in |
|
344 |
.I errbuf |
|
345 |
is the decimal digits of |
|
346 |
the numeric value of the error code |
|
347 |
(0 if the name is not recognized). |
|
348 |
REG_ITOA and REG_ATOI are intended primarily as debugging facilities; |
|
349 |
they are extensions, |
|
350 |
compatible with but not specified by POSIX 1003.2, |
|
351 |
and should be used with |
|
352 |
caution in software intended to be portable to other systems. |
|
353 |
Be warned also that they are considered experimental and changes are possible. |
|
354 |
.PP
|
|
355 |
.I Regfree |
|
356 |
frees any dynamically-allocated storage associated with the compiled RE |
|
357 |
pointed to by |
|
358 |
.IR preg . |
|
359 |
The remaining |
|
360 |
.I regex_t |
|
361 |
is no longer a valid compiled RE |
|
362 |
and the effect of supplying it to |
|
363 |
.I regexec |
|
364 |
or |
|
365 |
.I regerror |
|
366 |
is undefined. |
|
367 |
.PP
|
|
368 |
None of these functions references global variables except for tables |
|
369 |
of constants; |
|
370 |
all are safe for use from multiple threads if the arguments are safe. |
|
371 |
.SH IMPLEMENTATION CHOICES |
|
372 |
There are a number of decisions that 1003.2 leaves up to the implementor, |
|
373 |
either by explicitly saying ``undefined'' or by virtue of them being |
|
374 |
forbidden by the RE grammar. |
|
375 |
This implementation treats them as follows. |
|
376 |
.PP
|
|
377 |
See |
|
378 |
.ZR
|
|
379 |
for a discussion of the definition of case-independent matching. |
|
380 |
.PP
|
|
381 |
There is no particular limit on the length of REs, |
|
382 |
except insofar as memory is limited. |
|
383 |
Memory usage is approximately linear in RE size, and largely insensitive |
|
384 |
to RE complexity, except for bounded repetitions. |
|
385 |
See BUGS for one short RE using them |
|
386 |
that will run almost any system out of memory. |
|
387 |
.PP
|
|
388 |
A backslashed character other than one specifically given a magic meaning |
|
389 |
by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs) |
|
390 |
is taken as an ordinary character. |
|
391 |
.PP
|
|
392 |
Any unmatched [ is a REG_EBRACK error. |
|
393 |
.PP
|
|
394 |
Equivalence classes cannot begin or end bracket-expression ranges. |
|
395 |
The endpoint of one range cannot begin another. |
|
396 |
.PP
|
|
397 |
RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. |
|
398 |
.PP
|
|
399 |
A repetition operator (?, *, +, or bounds) cannot follow another |
|
400 |
repetition operator. |
|
401 |
A repetition operator cannot begin an expression or subexpression |
|
402 |
or follow `^' or `|'. |
|
403 |
.PP
|
|
404 |
`|' cannot appear first or last in a (sub)expression or after another `|', |
|
405 |
i.e. an operand of `|' cannot be an empty subexpression. |
|
406 |
An empty parenthesized subexpression, `()', is legal and matches an |
|
407 |
empty (sub)string. |
|
408 |
An empty string is not a legal RE. |
|
409 |
.PP
|
|
410 |
A `{' followed by a digit is considered the beginning of bounds for a |
|
411 |
bounded repetition, which must then follow the syntax for bounds. |
|
412 |
A `{' \fInot\fR followed by a digit is considered an ordinary character. |
|
413 |
.PP
|
|
414 |
`^' and `$' beginning and ending subexpressions in obsolete (``basic'') |
|
415 |
REs are anchors, not ordinary characters. |
|
416 |
.SH SEE ALSO |
|
417 |
grep(1), regex(7) |
|
418 |
.PP
|
|
419 |
POSIX 1003.2, sections 2.8 (Regular Expression Notation) |
|
420 |
and |
|
421 |
B.5 (C Binding for Regular Expression Matching). |
|
422 |
.SH DIAGNOSTICS |
|
423 |
Non-zero error codes from |
|
424 |
.I regcomp |
|
425 |
and |
|
426 |
.I regexec |
|
427 |
include the following: |
|
428 |
.PP
|
|
429 |
.nf
|
|
430 |
.ta \w'REG_ECOLLATE'u+3n |
|
431 |
REG_NOMATCH regexec() failed to match |
|
432 |
REG_BADPAT invalid regular expression |
|
433 |
REG_ECOLLATE invalid collating element |
|
434 |
REG_ECTYPE invalid character class |
|
435 |
REG_EESCAPE \e applied to unescapable character
|
|
436 |
REG_ESUBREG invalid backreference number |
|
437 |
REG_EBRACK brackets [ ] not balanced |
|
438 |
REG_EPAREN parentheses ( ) not balanced |
|
439 |
REG_EBRACE braces { } not balanced |
|
440 |
REG_BADBR invalid repetition count(s) in { } |
|
441 |
REG_ERANGE invalid character range in [ ] |
|
442 |
REG_ESPACE ran out of memory |
|
443 |
REG_BADRPT ?, *, or + operand invalid |
|
444 |
REG_EMPTY empty (sub)expression |
|
445 |
REG_ASSERT ``can't happen''\(emyou found a bug
|
|
446 |
REG_INVARG invalid argument, e.g. negative-length string |
|
447 |
.fi
|
|
448 |
.SH HISTORY |
|
449 |
Written by Henry Spencer at University of Toronto, |
|
450 |
henry@zoo.toronto.edu. |
|
451 |
.SH BUGS |
|
452 |
This is an alpha release with known defects. |
|
453 |
Please report problems. |
|
454 |
.PP
|
|
455 |
There is one known functionality bug. |
|
456 |
The implementation of internationalization is incomplete: |
|
457 |
the locale is always assumed to be the default one of 1003.2, |
|
458 |
and only the collating elements etc. of that locale are available. |
|
459 |
.PP
|
|
460 |
The back-reference code is subtle and doubts linger about its correctness |
|
461 |
in complex cases. |
|
462 |
.PP
|
|
463 |
.I Regexec |
|
464 |
performance is poor. |
|
465 |
This will improve with later releases. |
|
466 |
.I Nmatch |
|
467 |
exceeding 0 is expensive; |
|
468 |
.I nmatch |
|
469 |
exceeding 1 is worse. |
|
470 |
.I Regexec |
|
471 |
is largely insensitive to RE complexity \fIexcept\fR that back |
|
472 |
references are massively expensive. |
|
473 |
RE length does matter; in particular, there is a strong speed bonus |
|
474 |
for keeping RE length under about 30 characters, |
|
475 |
with most special characters counting roughly double. |
|
476 |
.PP
|
|
477 |
.I Regcomp |
|
478 |
implements bounded repetitions by macro expansion, |
|
479 |
which is costly in time and space if counts are large |
|
480 |
or bounded repetitions are nested. |
|
481 |
An RE like, say, |
|
482 |
`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' |
|
483 |
will (eventually) run almost any existing machine out of swap space. |
|
484 |
.PP
|
|
485 |
There are suspected problems with response to obscure error conditions. |
|
486 |
Notably, |
|
487 |
certain kinds of internal overflow, |
|
488 |
produced only by truly enormous REs or by multiply nested bounded repetitions, |
|
489 |
are probably not handled well. |
|
490 |
.PP
|
|
491 |
Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is |
|
492 |
a special character only in the presence of a previous unmatched `('. |
|
493 |
This can't be fixed until the spec is fixed. |
|
494 |
.PP
|
|
495 |
The standard's definition of back references is vague. |
|
496 |
For example, does |
|
497 |
`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? |
|
498 |
Until the standard is clarified, |
|
499 |
behavior in such cases should not be relied on. |
|
500 |
.PP
|
|
501 |
The implementation of word-boundary matching is a bit of a kludge, |
|
502 |
and bugs may lurk in combinations of word-boundary matching and anchoring. |