~drizzle-trunk/drizzle/development

« back to all changes in this revision

Viewing changes to regex/regex.3

Committer: Brian Aker
Date: 2008-07-08 21:36:11 UTC
mfrom: (77.1.34 codestyle)
Revision ID: brian@tangent.org-20080708213611-b0k2zy8eldttqct3

Merging up Monty's changes

files added:
AUTHORS

NEWS

README

support-files/libmysqlclient.pc.in

files removed:
.cvsignore

client/.cvsignore

dbug/.cvsignore

extra/.cvsignore

include/.cvsignore

libmysql/.cvsignore

libmysql/libmysql.def

mysql-test/mysql-test-run-shell.sh

mysys/.cvsignore

regex

regex/.cvsignore

regex/CHANGES

regex/COPYRIGHT

regex/Makefile.am

regex/README

regex/WHATSNEW

regex/cclass.h

regex/cname.h

regex/debug.c

regex/debug.ih

regex/engine.c

regex/engine.ih

regex/main.c

regex/main.ih

regex/make-ccc

regex/my_regex.h

regex/regcomp.c

regex/regcomp.ih

regex/regerror.c

regex/regerror.ih

regex/regex.3

regex/regex.7

regex/regex2.h

regex/regexec.c

regex/regexp.c

regex/regfree.c

regex/reginit.c

regex/split.c

regex/tests

regex/utils.h

scripts

scripts/.cvsignore

scripts/Makefile.am

scripts/fill_help_tables.sql

scripts/make_binary_distribution.sh

scripts/make_sharedlib_distribution.sh

scripts/mysql_config.pl.in

scripts/mysql_config.sh

scripts/mysql_convert_table_format.sh

scripts/mysql_find_rows.sh

scripts/mysql_fix_extensions.sh

scripts/mysql_install_db.sh

scripts/mysqld_multi.sh

scripts/mysqlhotcopy.sh

sql/.cvsignore

sql/share/.cvsignore

storage/heap/.cvsignore

storage/myisam/.cvsignore

strings/.cvsignore

support-files/.cvsignore

support-files/MySQL-shared-compat.spec.sh

support-files/binary-configure.sh

support-files/compiler_warnings.supp

support-files/magic

support-files/my-huge.cnf.sh

support-files/my-innodb-heavy-4G.cnf.sh

support-files/my-large.cnf.sh

support-files/my-medium.cnf.sh

support-files/my-small.cnf.sh

support-files/mysql-multi.server.sh

support-files/mysql.server-sys5.sh

support-files/mysql.spec.sh

support-files/mysqld_multi.server.sh

vio/.cvsignore

files renamed:
INSTALL-SOURCE => INSTALL

client/mysqltest.c => client/mysqltest.cc

mysql-test/install_test_db.sh => mysql-test/install_test_db.in

extra/resolve_stack_dump.c => mysql-test/resolve_stack_dump.c

sql/mysqld.cc => sql/drizzled.cc

scripts/mysqld_safe.sh => sql/drizzled_safe.in

scripts/mysqldumpslow.sh => sql/drizzledumpslow

support-files/mysql-log-rotate.sh => support-files/mysql-log-rotate.in

support-files/mysql.server.sh => support-files/mysql.server.in

files modified:
.bzrignore

Makefile.am

client/Makefile.am

config/autorun.sh

configure.ac

include/my_global.h

mysql-test/Makefile.am

mysql-test/resolve-stack

sql/Makefile.am

support-files/Makefile.am

Show diffs side-by-side

added added

removed removed

regex/regex.3

.TH REGEX 3 "17 May 1993"

.BY "Henry Spencer"

.de ZR

.\" one other place knows this name: the SEE ALSO section

.IR regex (7) \\$1

.SH NAME

regcomp, regexec, regerror, regfree \- regular-expression library

.SH SYNOPSIS

.ft B

.\".na

#include <sys/types.h>

.br

#include <regex.h>

.HP 10

int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);

.HP

int\ regexec(const\ regex_t\ *preg, const\ char\ *string,

size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);

.HP

size_t\ regerror(int\ errcode, const\ regex_t\ *preg,

char\ *errbuf, size_t\ errbuf_size);

.HP

void\ regfree(regex_t\ *preg);

.\".ad

.ft

.SH DESCRIPTION

These routines implement POSIX 1003.2 regular expressions (``RE''s);

see

.ZR .

.I Regcomp

compiles an RE written as a string into an internal form,

.I regexec

matches that internal form against a string and reports results,

.I regerror

transforms error codes from either into human-readable messages,

and

.I regfree

frees any dynamically-allocated storage used by the internal form

of an RE.

.PP

The header

.I <regex.h>

declares two structure types,

.I regex_t

and

.IR regmatch_t ,

the former for compiled internal forms and the latter for match reporting.

It also declares the four functions,

a type

.IR regoff_t ,

and a number of constants with names starting with ``REG_''.

.PP

.I Regcomp

compiles the regular expression contained in the

.I pattern

string,

subject to the flags in

.IR cflags ,

and places the results in the

.I regex_t

structure pointed to by

.IR preg .

.I Cflags

is the bitwise OR of zero or more of the following flags:

.IP REG_EXTENDED \w'REG_EXTENDED'u+2n

Compile modern (``extended'') REs,

rather than the obsolete (``basic'') REs that

are the default.

.IP REG_BASIC

This is a synonym for 0,

provided as a counterpart to REG_EXTENDED to improve readability.

.IP REG_NOSPEC

Compile with recognition of all special characters turned off.

All characters are thus considered ordinary,

so the ``RE'' is a literal string.

This is an extension,

compatible with but not specified by POSIX 1003.2,

and should be used with

caution in software intended to be portable to other systems.

REG_EXTENDED and REG_NOSPEC may not be used

in the same call to

.IR regcomp .

.IP REG_ICASE

Compile for matching that ignores upper/lower case distinctions.

See

.ZR .

.IP REG_NOSUB

Compile for matching that need only report success or failure,

not what was matched.

.IP REG_NEWLINE

Compile for newline-sensitive matching.

By default, newline is a completely ordinary character with no special

meaning in either REs or strings.

With this flag,

`[^' bracket expressions and `.' never match newline,

a `^' anchor matches the null string after any newline in the string

in addition to its normal function,

and the `$' anchor matches the null string before any newline in the

100

string in addition to its normal function.

101

.IP REG_PEND

102

The regular expression ends,

103

not at the first NUL,

104

but just before the character pointed to by the

105

.I re_endp

106

member of the structure pointed to by

107

.IR preg .

108

The

109

.I re_endp

110

member is of type

111

.IR const\ char\ * .

112

This flag permits inclusion of NULs in the RE;

113

they are considered ordinary characters.

114

This is an extension,

115

compatible with but not specified by POSIX 1003.2,

116

and should be used with

117

caution in software intended to be portable to other systems.

118

.PP

119

When successful,

120

.I regcomp

121

returns 0 and fills in the structure pointed to by

122

.IR preg .

123

One member of that structure

124

(other than

125

.IR re_endp )

126

is publicized:

127

.IR re_nsub ,

128

of type

129

.IR size_t ,

130

contains the number of parenthesized subexpressions within the RE

131

(except that the value of this member is undefined if the

132

REG_NOSUB flag was used).

133

134

.I regcomp

135

fails, it returns a non-zero error code;

136

see DIAGNOSTICS.

137

.PP

138

.I Regexec

139

matches the compiled RE pointed to by

140

.I preg

141

against the

142

.IR string ,

143

subject to the flags in

144

.IR eflags ,

145

and reports results using

146

.IR nmatch ,

147

.IR pmatch ,

148

and the returned value.

149

The RE must have been compiled by a previous invocation of

150

.IR regcomp .

151

The compiled form is not altered during execution of

152

.IR regexec ,

153

so a single compiled RE can be used simultaneously by multiple threads.

154

.PP

155

By default,

156

the NUL-terminated string pointed to by

157

.I string

158

is considered to be the text of an entire line, minus any terminating

159

newline.

160

The

161

.I eflags

162

argument is the bitwise OR of zero or more of the following flags:

163

.IP REG_NOTBOL \w'REG_STARTEND'u+2n

164

The first character of

165

the string

166

is not the beginning of a line, so the `^' anchor should not match before it.

167

This does not affect the behavior of newlines under REG_NEWLINE.

168

.IP REG_NOTEOL

169

The NUL terminating

170

the string

171

does not end a line, so the `$' anchor should not match before it.

172

This does not affect the behavior of newlines under REG_NEWLINE.

173

.IP REG_STARTEND

174

The string is considered to start at

175

\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR

176

and to have a terminating NUL located at

177

\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR

178

(there need not actually be a NUL at that location),

179

regardless of the value of

180

.IR nmatch .

181

See below for the definition of

182

.IR pmatch

183

and

184

.IR nmatch .

185

This is an extension,

186

compatible with but not specified by POSIX 1003.2,

187

and should be used with

188

caution in software intended to be portable to other systems.

189

Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;

190

REG_STARTEND affects only the location of the string,

191

not how it is matched.

192

.PP

193

See

194

.ZR

195

for a discussion of what is matched in situations where an RE or a

196

portion thereof could match any of several substrings of

197

.IR string .

198

.PP

199

Normally,

200

.I regexec

201

returns 0 for success and the non-zero code REG_NOMATCH for failure.

202

Other non-zero error codes may be returned in exceptional situations;

203

see DIAGNOSTICS.

204

.PP

205

If REG_NOSUB was specified in the compilation of the RE,

206

or if

207

.I nmatch

208

is 0,

209

.I regexec

210

ignores the

211

.I pmatch

212

argument (but see below for the case where REG_STARTEND is specified).

213

Otherwise,

214

.I pmatch

215

points to an array of

216

.I nmatch

217

structures of type

218

.IR regmatch_t .

219

Such a structure has at least the members

220

.I rm_so

221

and

222

.IR rm_eo ,

223

both of type

224

.I regoff_t

225

(a signed arithmetic type at least as large as an

226

.I off_t

227

and a

228

.IR ssize_t ),

229

containing respectively the offset of the first character of a substring

230

and the offset of the first character after the end of the substring.

231

Offsets are measured from the beginning of the

232

.I string

233

argument given to

234

.IR regexec .

235

An empty substring is denoted by equal offsets,

236

both indicating the character following the empty substring.

237

.PP

238

The 0th member of the

239

.I pmatch

240

array is filled in to indicate what substring of

241

.I string

242

was matched by the entire RE.

243

Remaining members report what substring was matched by parenthesized

244

subexpressions within the RE;

245

member

246

.I i

247

reports subexpression

248

.IR i ,

249

with subexpressions counted (starting at 1) by the order of their opening

250

parentheses in the RE, left to right.

251

Unused entries in the array\(emcorresponding either to subexpressions that

252

did not participate in the match at all, or to subexpressions that do not

253

exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both

254

.I rm_so

255

and

256

.I rm_eo

257

set to \-1.

258

If a subexpression participated in the match several times,

259

the reported substring is the last one it matched.

260

(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',

261

the parenthesized subexpression matches each of the three `b's and then

262

an infinite number of empty strings following the last `b',

263

so the reported substring is one of the empties.)

264

.PP

265

If REG_STARTEND is specified,

266

.I pmatch

267

must point to at least one

268

.I regmatch_t

269

(even if

270

.I nmatch

271

is 0 or REG_NOSUB was specified),

272

to hold the input offsets for REG_STARTEND.

273

Use for output is still entirely controlled by

274

.IR nmatch ;

275

276

.I nmatch

277

is 0 or REG_NOSUB was specified,

278

the value of

279

.IR pmatch [0]

280

will not be changed by a successful

281

.IR regexec .

282

.PP

283

.I Regerror

284

maps a non-zero

285

.I errcode

286

from either

287

.I regcomp

288

289

.I regexec

290

to a human-readable, printable message.

291

292

.I preg

293

is non-NULL,

294

the error code should have arisen from use of

295

the

296

.I regex_t

297

pointed to by

298

.IR preg ,

299

and if the error code came from

300

.IR regcomp ,

301

it should have been the result from the most recent

302

.I regcomp

303

using that

304

.IR regex_t .

305

.RI ( Regerror

306

may be able to supply a more detailed message using information

307

from the

308

.IR regex_t .)

309

.I Regerror

310

places the NUL-terminated message into the buffer pointed to by

311

.IR errbuf ,

312

limiting the length (including the NUL) to at most

313

.I errbuf_size

314

bytes.

315

If the whole message won't fit,

316

as much of it as will fit before the terminating NUL is supplied.

317

In any case,

318

the returned value is the size of buffer needed to hold the whole

319

message (including terminating NUL).

320

321

.I errbuf_size

322

is 0,

323

.I errbuf

324

is ignored but the return value is still correct.

325

.PP

326

If the

327

.I errcode

328

given to

329

.I regerror

330

is first ORed with REG_ITOA,

331

the ``message'' that results is the printable name of the error code,

332

e.g. ``REG_NOMATCH'',

333

rather than an explanation thereof.

334

335

.I errcode

336

is REG_ATOI,

337

then

338

.I preg

339

shall be non-NULL and the

340

.I re_endp

341

member of the structure it points to

342

must point to the printable name of an error code;

343

in this case, the result in

344

.I errbuf

345

is the decimal digits of

346

the numeric value of the error code

347

(0 if the name is not recognized).

348

REG_ITOA and REG_ATOI are intended primarily as debugging facilities;

349

they are extensions,

350

compatible with but not specified by POSIX 1003.2,

351

and should be used with

352

caution in software intended to be portable to other systems.

353

Be warned also that they are considered experimental and changes are possible.

354

.PP

355

.I Regfree

356

frees any dynamically-allocated storage associated with the compiled RE

357

pointed to by

358

.IR preg .

359

The remaining

360

.I regex_t

361

is no longer a valid compiled RE

362

and the effect of supplying it to

363

.I regexec

364

365

.I regerror

366

is undefined.

367

.PP

368

None of these functions references global variables except for tables

369

of constants;

370

all are safe for use from multiple threads if the arguments are safe.

371

.SH IMPLEMENTATION CHOICES

372

There are a number of decisions that 1003.2 leaves up to the implementor,

373

either by explicitly saying ``undefined'' or by virtue of them being

374

forbidden by the RE grammar.

375

This implementation treats them as follows.

376

.PP

377

See

378

.ZR

379

for a discussion of the definition of case-independent matching.

380

.PP

381

There is no particular limit on the length of REs,

382

except insofar as memory is limited.

383

Memory usage is approximately linear in RE size, and largely insensitive

384

to RE complexity, except for bounded repetitions.

385

See BUGS for one short RE using them

386

that will run almost any system out of memory.

387

.PP

388

A backslashed character other than one specifically given a magic meaning

389

by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)

390

is taken as an ordinary character.

391

.PP

392

Any unmatched [ is a REG_EBRACK error.

393

.PP

394

Equivalence classes cannot begin or end bracket-expression ranges.

395

The endpoint of one range cannot begin another.

396

.PP

397

RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.

398

.PP

399

A repetition operator (?, *, +, or bounds) cannot follow another

400

repetition operator.

401

A repetition operator cannot begin an expression or subexpression

402

or follow `^' or `|'.

403

.PP

404

`|' cannot appear first or last in a (sub)expression or after another `|',

405

i.e. an operand of `|' cannot be an empty subexpression.

406

An empty parenthesized subexpression, `()', is legal and matches an

407

empty (sub)string.

408

An empty string is not a legal RE.

409

.PP

410

A `{' followed by a digit is considered the beginning of bounds for a

411

bounded repetition, which must then follow the syntax for bounds.

412

A `{' \fInot\fR followed by a digit is considered an ordinary character.

413

.PP

414

`^' and `$' beginning and ending subexpressions in obsolete (``basic'')

415

REs are anchors, not ordinary characters.

416

.SH SEE ALSO

417

grep(1), regex(7)

418

.PP

419

POSIX 1003.2, sections 2.8 (Regular Expression Notation)

420

and

421

B.5 (C Binding for Regular Expression Matching).

422

.SH DIAGNOSTICS

423

Non-zero error codes from

424

.I regcomp

425

and

426

.I regexec

427

include the following:

428

.PP

429

.nf

430

.ta \w'REG_ECOLLATE'u+3n

431

REG_NOMATCH regexec() failed to match

432

REG_BADPAT invalid regular expression

433

REG_ECOLLATE invalid collating element

434

REG_ECTYPE invalid character class

435

REG_EESCAPE \e applied to unescapable character

436

REG_ESUBREG invalid backreference number

437

REG_EBRACK brackets [ ] not balanced

438

REG_EPAREN parentheses ( ) not balanced

439

REG_EBRACE braces { } not balanced

440

REG_BADBR invalid repetition count(s) in { }

441

REG_ERANGE invalid character range in [ ]

442

REG_ESPACE ran out of memory

443

REG_BADRPT ?, *, or + operand invalid

444

REG_EMPTY empty (sub)expression

445

REG_ASSERT ``can't happen''\(emyou found a bug

446

REG_INVARG invalid argument, e.g. negative-length string

447

.fi

448

.SH HISTORY

449

Written by Henry Spencer at University of Toronto,

450

henry@zoo.toronto.edu.

451

.SH BUGS

452

This is an alpha release with known defects.

453

Please report problems.

454

.PP

455

There is one known functionality bug.

456

The implementation of internationalization is incomplete:

457

the locale is always assumed to be the default one of 1003.2,

458

and only the collating elements etc. of that locale are available.

459

.PP

460

The back-reference code is subtle and doubts linger about its correctness

461

in complex cases.

462

.PP

463

.I Regexec

464

performance is poor.

465

This will improve with later releases.

466

.I Nmatch

467

exceeding 0 is expensive;

468

.I nmatch

469

exceeding 1 is worse.

470

.I Regexec

471

is largely insensitive to RE complexity \fIexcept\fR that back

472

references are massively expensive.

473

RE length does matter; in particular, there is a strong speed bonus

474

for keeping RE length under about 30 characters,

475

with most special characters counting roughly double.

476

.PP

477

.I Regcomp

478

implements bounded repetitions by macro expansion,

479

which is costly in time and space if counts are large

480

or bounded repetitions are nested.

481

An RE like, say,

482

`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'

483

will (eventually) run almost any existing machine out of swap space.

484

.PP

485

There are suspected problems with response to obscure error conditions.

486

Notably,

487

certain kinds of internal overflow,

488

produced only by truly enormous REs or by multiply nested bounded repetitions,

489

are probably not handled well.

490

.PP

491

Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is

492

a special character only in the presence of a previous unmatched `('.

493

This can't be fixed until the spec is fixed.

494

.PP

495

The standard's definition of back references is vague.

496

For example, does

497

`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?

498

Until the standard is clarified,

499

behavior in such cases should not be relied on.

500

.PP

501

The implementation of word-boundary matching is a bit of a kludge,

502

and bugs may lurk in combinations of word-boundary matching and anchoring.

Older »