1
.TH REGEX 3 "17 May 1993"
4
.\" one other place knows this name: the SEE ALSO section
8
regcomp, regexec, regerror, regfree \- regular-expression library
12
#include <sys/types.h>
16
int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
18
int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19
size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
21
size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22
char\ *errbuf, size_t\ errbuf_size);
24
void\ regfree(regex_t\ *preg);
28
These routines implement POSIX 1003.2 regular expressions (``RE''s);
32
compiles an RE written as a string into an internal form,
34
matches that internal form against a string and reports results,
36
transforms error codes from either into human-readable messages,
39
frees any dynamically-allocated storage used by the internal form
44
declares two structure types,
48
the former for compiled internal forms and the latter for match reporting.
49
It also declares the four functions,
52
and a number of constants with names starting with ``REG_''.
55
compiles the regular expression contained in the
58
subject to the flags in
60
and places the results in the
62
structure pointed to by
65
is the bitwise OR of zero or more of the following flags:
66
.IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67
Compile modern (``extended'') REs,
68
rather than the obsolete (``basic'') REs that
71
This is a synonym for 0,
72
provided as a counterpart to REG_EXTENDED to improve readability.
74
Compile with recognition of all special characters turned off.
75
All characters are thus considered ordinary,
76
so the ``RE'' is a literal string.
78
compatible with but not specified by POSIX 1003.2,
79
and should be used with
80
caution in software intended to be portable to other systems.
81
REG_EXTENDED and REG_NOSPEC may not be used
85
Compile for matching that ignores upper/lower case distinctions.
89
Compile for matching that need only report success or failure,
92
Compile for newline-sensitive matching.
93
By default, newline is a completely ordinary character with no special
94
meaning in either REs or strings.
96
`[^' bracket expressions and `.' never match newline,
97
a `^' anchor matches the null string after any newline in the string
98
in addition to its normal function,
99
and the `$' anchor matches the null string before any newline in the
100
string in addition to its normal function.
102
The regular expression ends,
103
not at the first NUL,
104
but just before the character pointed to by the
106
member of the structure pointed to by
112
This flag permits inclusion of NULs in the RE;
113
they are considered ordinary characters.
114
This is an extension,
115
compatible with but not specified by POSIX 1003.2,
116
and should be used with
117
caution in software intended to be portable to other systems.
121
returns 0 and fills in the structure pointed to by
123
One member of that structure
130
contains the number of parenthesized subexpressions within the RE
131
(except that the value of this member is undefined if the
132
REG_NOSUB flag was used).
135
fails, it returns a non-zero error code;
139
matches the compiled RE pointed to by
143
subject to the flags in
145
and reports results using
148
and the returned value.
149
The RE must have been compiled by a previous invocation of
151
The compiled form is not altered during execution of
153
so a single compiled RE can be used simultaneously by multiple threads.
156
the NUL-terminated string pointed to by
158
is considered to be the text of an entire line, minus any terminating
162
argument is the bitwise OR of zero or more of the following flags:
163
.IP REG_NOTBOL \w'REG_STARTEND'u+2n
164
The first character of
166
is not the beginning of a line, so the `^' anchor should not match before it.
167
This does not affect the behavior of newlines under REG_NEWLINE.
171
does not end a line, so the `$' anchor should not match before it.
172
This does not affect the behavior of newlines under REG_NEWLINE.
174
The string is considered to start at
175
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
176
and to have a terminating NUL located at
177
\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
178
(there need not actually be a NUL at that location),
179
regardless of the value of
181
See below for the definition of
185
This is an extension,
186
compatible with but not specified by POSIX 1003.2,
187
and should be used with
188
caution in software intended to be portable to other systems.
189
Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
190
REG_STARTEND affects only the location of the string,
191
not how it is matched.
195
for a discussion of what is matched in situations where an RE or a
196
portion thereof could match any of several substrings of
201
returns 0 for success and the non-zero code REG_NOMATCH for failure.
202
Other non-zero error codes may be returned in exceptional situations;
205
If REG_NOSUB was specified in the compilation of the RE,
212
argument (but see below for the case where REG_STARTEND is specified).
215
points to an array of
219
Such a structure has at least the members
225
(a signed arithmetic type at least as large as an
229
containing respectively the offset of the first character of a substring
230
and the offset of the first character after the end of the substring.
231
Offsets are measured from the beginning of the
235
An empty substring is denoted by equal offsets,
236
both indicating the character following the empty substring.
238
The 0th member of the
240
array is filled in to indicate what substring of
242
was matched by the entire RE.
243
Remaining members report what substring was matched by parenthesized
244
subexpressions within the RE;
247
reports subexpression
249
with subexpressions counted (starting at 1) by the order of their opening
250
parentheses in the RE, left to right.
251
Unused entries in the array\(emcorresponding either to subexpressions that
252
did not participate in the match at all, or to subexpressions that do not
253
exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
258
If a subexpression participated in the match several times,
259
the reported substring is the last one it matched.
260
(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
261
the parenthesized subexpression matches each of the three `b's and then
262
an infinite number of empty strings following the last `b',
263
so the reported substring is one of the empties.)
265
If REG_STARTEND is specified,
267
must point to at least one
271
is 0 or REG_NOSUB was specified),
272
to hold the input offsets for REG_STARTEND.
273
Use for output is still entirely controlled by
277
is 0 or REG_NOSUB was specified,
280
will not be changed by a successful
290
to a human-readable, printable message.
294
the error code should have arisen from use of
299
and if the error code came from
301
it should have been the result from the most recent
306
may be able to supply a more detailed message using information
310
places the NUL-terminated message into the buffer pointed to by
312
limiting the length (including the NUL) to at most
315
If the whole message won't fit,
316
as much of it as will fit before the terminating NUL is supplied.
318
the returned value is the size of buffer needed to hold the whole
319
message (including terminating NUL).
324
is ignored but the return value is still correct.
330
is first ORed with REG_ITOA,
331
the ``message'' that results is the printable name of the error code,
332
e.g. ``REG_NOMATCH'',
333
rather than an explanation thereof.
339
shall be non-NULL and the
341
member of the structure it points to
342
must point to the printable name of an error code;
343
in this case, the result in
345
is the decimal digits of
346
the numeric value of the error code
347
(0 if the name is not recognized).
348
REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
350
compatible with but not specified by POSIX 1003.2,
351
and should be used with
352
caution in software intended to be portable to other systems.
353
Be warned also that they are considered experimental and changes are possible.
356
frees any dynamically-allocated storage associated with the compiled RE
361
is no longer a valid compiled RE
362
and the effect of supplying it to
368
None of these functions references global variables except for tables
370
all are safe for use from multiple threads if the arguments are safe.
371
.SH IMPLEMENTATION CHOICES
372
There are a number of decisions that 1003.2 leaves up to the implementor,
373
either by explicitly saying ``undefined'' or by virtue of them being
374
forbidden by the RE grammar.
375
This implementation treats them as follows.
379
for a discussion of the definition of case-independent matching.
381
There is no particular limit on the length of REs,
382
except insofar as memory is limited.
383
Memory usage is approximately linear in RE size, and largely insensitive
384
to RE complexity, except for bounded repetitions.
385
See BUGS for one short RE using them
386
that will run almost any system out of memory.
388
A backslashed character other than one specifically given a magic meaning
389
by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
390
is taken as an ordinary character.
392
Any unmatched [ is a REG_EBRACK error.
394
Equivalence classes cannot begin or end bracket-expression ranges.
395
The endpoint of one range cannot begin another.
397
RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
399
A repetition operator (?, *, +, or bounds) cannot follow another
401
A repetition operator cannot begin an expression or subexpression
402
or follow `^' or `|'.
404
`|' cannot appear first or last in a (sub)expression or after another `|',
405
i.e. an operand of `|' cannot be an empty subexpression.
406
An empty parenthesized subexpression, `()', is legal and matches an
408
An empty string is not a legal RE.
410
A `{' followed by a digit is considered the beginning of bounds for a
411
bounded repetition, which must then follow the syntax for bounds.
412
A `{' \fInot\fR followed by a digit is considered an ordinary character.
414
`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
415
REs are anchors, not ordinary characters.
419
POSIX 1003.2, sections 2.8 (Regular Expression Notation)
421
B.5 (C Binding for Regular Expression Matching).
423
Non-zero error codes from
427
include the following:
430
.ta \w'REG_ECOLLATE'u+3n
431
REG_NOMATCH regexec() failed to match
432
REG_BADPAT invalid regular expression
433
REG_ECOLLATE invalid collating element
434
REG_ECTYPE invalid character class
435
REG_EESCAPE \e applied to unescapable character
436
REG_ESUBREG invalid backreference number
437
REG_EBRACK brackets [ ] not balanced
438
REG_EPAREN parentheses ( ) not balanced
439
REG_EBRACE braces { } not balanced
440
REG_BADBR invalid repetition count(s) in { }
441
REG_ERANGE invalid character range in [ ]
442
REG_ESPACE ran out of memory
443
REG_BADRPT ?, *, or + operand invalid
444
REG_EMPTY empty (sub)expression
445
REG_ASSERT ``can't happen''\(emyou found a bug
446
REG_INVARG invalid argument, e.g. negative-length string
449
Written by Henry Spencer at University of Toronto,
450
henry@zoo.toronto.edu.
452
This is an alpha release with known defects.
453
Please report problems.
455
There is one known functionality bug.
456
The implementation of internationalization is incomplete:
457
the locale is always assumed to be the default one of 1003.2,
458
and only the collating elements etc. of that locale are available.
460
The back-reference code is subtle and doubts linger about its correctness
465
This will improve with later releases.
467
exceeding 0 is expensive;
469
exceeding 1 is worse.
471
is largely insensitive to RE complexity \fIexcept\fR that back
472
references are massively expensive.
473
RE length does matter; in particular, there is a strong speed bonus
474
for keeping RE length under about 30 characters,
475
with most special characters counting roughly double.
478
implements bounded repetitions by macro expansion,
479
which is costly in time and space if counts are large
480
or bounded repetitions are nested.
482
`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
483
will (eventually) run almost any existing machine out of swap space.
485
There are suspected problems with response to obscure error conditions.
487
certain kinds of internal overflow,
488
produced only by truly enormous REs or by multiply nested bounded repetitions,
489
are probably not handled well.
491
Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
492
a special character only in the presence of a previous unmatched `('.
493
This can't be fixed until the spec is fixed.
495
The standard's definition of back references is vague.
497
`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
498
Until the standard is clarified,
499
behavior in such cases should not be relied on.
501
The implementation of word-boundary matching is a bit of a kludge,
502
and bugs may lurk in combinations of word-boundary matching and anchoring.