4
A structure containing data for charset+collation pair implementation.
6
Virtual functions which use this data are collected
7
into separate structures MY_CHARSET_HANDLER and
11
typedef struct charset_info_st
28
MY_UNI_IDX *tab_from_uni;
33
uint strxfrm_multiply;
36
uint16 max_sort_char; /* For LIKE optimization */
38
MY_CHARSET_HANDLER *cset;
39
MY_COLLATION_HANDLER *coll;
44
CHARSET_INFO fields description:
45
===============================
51
number - an ID uniquely identifying this charset+collation pair.
53
primary_number - ID of a charset+collation pair, which consists
54
of the same character set and the default collation of this
55
character set. Not really used now. Intended to optimize some
56
parts of the code where we need to find the default collation
57
using its non-default counterpart for the given character set.
59
binary_numner - ID of a charset+collation pair, which consists
60
of the same character set and the binary collation of this
61
character set. Not really used now.
66
csname - name of the character set for this charset+collation pair.
67
name - name of the collation for this charset+collation pair.
68
comment - a text comment, dysplayed in "Description" column of
69
SHOW CHARACTER SET output.
74
ctype - pointer to array[257] of "type of characters"
75
bit mask for each chatacter, e.g. if a
76
character is a digit or a letter or a separator, etc.
79
If you look at the macros, we use ctype[(char)+1].
80
ctype[0] is traditionally in most ctype libraries
81
reserved for EOF (-1). The idea is that you can use
82
the result from fgetc() directly with ctype[]. As
83
we have to be compatible with external ctype[] versions,
84
it's better to do it the same way as they do...
86
to_lower - pointer to array[256] used in LCASE()
87
to_upper - pointer to array[256] used in UCASE()
88
sort_order - pointer to array[256] used for strings comparison
92
Unicode conversion data
93
-----------------------
94
For 8bit character sets:
96
tab_to_uni : array[256] of charset->Unicode translation
97
tab_from_uni: a structure for Unicode->charset translation
99
Non-8 bit charsets have their own structures per charset
100
hidden in correspondent ctype-xxx.c file and don't use
101
tab_to_uni and tab_from_uni tables.
109
These maps are to quickly identify if a character is
110
an identificator part, a digit, a special character,
111
or a part of other SQL language lexical item.
113
Probably can be combined with ctype array in the future.
114
But for some reasons these two arrays are used in the parser,
115
while a separate ctype[] array is used in the other part of the
116
code, like fulltext, etc.
122
strxfrm_multiply - how many times a sort key (i.e. a string
123
which can be passed into memcmp() for comparison)
124
can be longer than the original string.
125
Usually it is 1. For some complex
126
collations it can be bigger. For example
127
in latin1_german2_ci, a sort key is up to
128
twice longer than the original string.
129
e.g. Letter 'A' with two dots above is
130
substituted with 'AE'.
131
mbminlen - mininum multibyte sequence length.
132
Now always 1 except ucs2. For ucs2
134
mbmaxlen - maximum multibyte sequence length.
135
1 for 8bit charsets. Can be also 2 or 3.
137
max_sort_char - for LIKE range
138
in case of 8bit character sets - native code
139
of maximum character (max_str pad byte);
140
in case of UTF8 and UCS2 - Unicode code of the maximum
141
possible character (usually U+FFFF). This code is
142
converted to multibyte representation (usually 0xEFBFBF)
143
and then used as a pad sequence for max_str.
144
in case of other multibyte character sets -
145
max_str pad byte (usually 0xFF).
150
MY_CHARSET_HANDLER is a collection of character-set
151
related routines. Defined in m_ctype.h. Have the
152
following set of functions:
156
ismbchar() - detects if the given string is a multibyte sequence
157
mbcharlen() - returns length of multibyte sequence starting with
159
numchars() - returns number of characters in the given string, e.g.
160
in SQL function CHAR_LENGTH().
161
charpos() - calculates the offset of the given position in the string.
162
Used in SQL functions LEFT(), RIGHT(), SUBSTRING(),
166
- finds the length of correctly formed multybyte beginning.
167
Used in INSERTs to cut a beginning of the given string
169
a) "well formed" according to the given character set.
170
b) can fit into the given data type
171
Terminates the string in the good position, taking in account
172
multibyte character boundaries.
174
lengthsp() - returns the length of the given string without traling spaces.
177
Unicode conversion routines
178
---------------------------
179
mb_wc - converts the left multibyte sequence into it Unicode code.
180
mc_mb - converts the given Unicode code into multibyte sequence.
183
Case and sort conversion
184
------------------------
185
caseup_str - converts the given 0-terminated string into the upper case
186
casedn_str - converts the given 0-terminated string into the lower case
187
caseup - converts the given string into the lower case using length
188
casedn - converts the given string into the lower case using length
190
Number-to-string conversion routines
191
------------------------------------
196
The names are pretty self-descripting.
198
String padding routines
199
-----------------------
200
fill() - writes the given Unicode value into the given string
201
with the given length. Used to pad the string, usually
202
with space character, according to the given charset.
204
String-to-numner conversion routines
205
------------------------------------
212
These functions are almost for the same thing with their
213
STDLIB counterparts, but also:
214
- accept length instead of 0-terminator
215
- and are character set dependant
217
Simple scanner routines
218
-----------------------
219
scan() - to skip leading spaces in the given string.
220
Used when a string value is inserted into a numeric field.
226
strnncoll() - compares two strings according to the given collation
227
strnncollsp() - like the above but ignores trailing spaces
228
strnxfrm() - makes a sort key suitable for memcmp() corresponding
230
like_range() - creates a LIKE range, for optimizer
231
wildcmp() - wildcard comparison, for LIKE
232
strcasecmp() - 0-terminated string comparison
233
instr() - finds the first substring appearence in the string
234
hash_sort() - calculates hash value taking in account
235
the collation rules, e.g. case-insensitivity,
236
accent sensitivity, etc.