31
31
This example was written as a test case for a customer who needed
32
32
a storage engine without indexes that could compress data very well.
33
33
So, welcome to a completely compressed storage engine. This storage
34
engine only does inserts. No replace, deletes, or updates. All reads are
34
engine only does inserts. No replace, deletes, or updates. All reads are
35
35
complete table scans. Compression is done through a combination of packing
36
36
and making use of the zlib library
38
38
We keep a file pointer open for each instance of ha_archive for each read
39
39
but for writes we keep one open file handle just for that. We flush it
40
40
only if we have a read occur. azip handles compressing lots of records
44
44
the same time since we would want to flush).
46
46
A "meta" file is kept alongside the data file. This file serves two purpose.
47
The first purpose is to track the number of rows in the table. The second
48
purpose is to determine if the table was closed properly or not. When the
49
meta file is first opened it is marked as dirty. It is opened when the table
50
itself is opened for writing. When the table is closed the new count for rows
51
is written to the meta file and the file is marked as clean. If the meta file
52
is opened and it is marked as dirty, it is assumed that a crash occured. At
47
The first purpose is to track the number of rows in the table. The second
48
purpose is to determine if the table was closed properly or not. When the
49
meta file is first opened it is marked as dirty. It is opened when the table
50
itself is opened for writing. When the table is closed the new count for rows
51
is written to the meta file and the file is marked as clean. If the meta file
52
is opened and it is marked as dirty, it is assumed that a crash occured. At
53
53
this point an error occurs and the user is told to rebuild the file.
54
54
A rebuild scans the rows and rewrites the meta file. If corruption is found
55
55
in the data file then the meta file is not repaired.
57
57
At some point a recovery method for such a drastic case needs to be divised.
59
Locks are row level, and you will get a consistant read.
59
Locks are row level, and you will get a consistant read.
61
61
For performance as far as table scans go it is quite fast. I don't have
62
62
good numbers but locally it has out performed both Innodb and MyISAM. For
63
63
Innodb the question will be if the table can be fit into the buffer
64
64
pool. For MyISAM its a question of how much the file system caches the
65
65
MyISAM file. With enough free memory MyISAM is faster. Its only when the OS
66
doesn't have enough memory to cache entire table that archive turns out
66
doesn't have enough memory to cache entire table that archive turns out
69
69
Examples between MyISAM (packed) and Archive.
105
105
/* Static declarations for handerton */
106
static handler *archive_create_handler(handlerton *hton,
106
static handler *archive_create_handler(handlerton *hton,
108
108
MEM_ROOT *mem_root);
109
int archive_discover(handlerton *hton, Session* session, const char *db,
109
int archive_discover(handlerton *hton, Session* session, const char *db,
110
110
const char *name,
111
unsigned char **frmblob,
111
unsigned char **frmblob,
114
114
static bool archive_use_aio= false;
269
We create the shared memory space that we will use for the open table.
269
We create the shared memory space that we will use for the open table.
270
270
No matter what we try to get or create a share. This is so that a repair
271
table operation can occur.
271
table operation can occur.
273
273
See ha_example.cc for a longer description.
308
308
We will use this lock for rows.
310
310
pthread_mutex_init(&share->mutex,MY_MUTEX_INIT_FAST);
313
313
We read the meta file, but do not mark it dirty. Since we are not
314
314
doing a write we won't mark it dirty (and we won't open it for
360
360
hash_delete(&archive_open_tables, (unsigned char*) share);
361
361
thr_lock_delete(&share->lock);
362
362
pthread_mutex_destroy(&share->mutex);
364
364
We need to make sure we don't reset the crashed state.
365
365
If we open a crashed file, wee need to close it as crashed unless
366
366
it has been repaired.
382
382
int ha_archive::init_archive_writer()
385
385
It is expensive to open and close the data files and since you can't have
386
386
a gzip file that can be both read and written we keep a writer open
387
387
that is shared amoung all open tables.
389
if (!(azopen(&(share->archive_write), share->data_file_name,
389
if (!(azopen(&(share->archive_write), share->data_file_name,
390
390
O_RDWR, AZ_METHOD_BLOCK)))
392
392
share->crashed= true;
402
402
No locks are required because it is associated with just one handler instance
404
404
int ha_archive::init_archive_reader()
407
407
It is expensive to open and close the data files and since you can't have
408
408
a gzip file that can be both read and written we keep a writer open
409
409
that is shared amoung all open tables.
424
424
method= AZ_METHOD_BLOCK;
426
if (!(azopen(&archive, share->data_file_name, O_RDONLY,
426
if (!(azopen(&archive, share->data_file_name, O_RDONLY,
429
429
share->crashed= true;
535
We create our data file here. The format is pretty simple.
535
We create our data file here. The format is pretty simple.
536
536
You can read about the format of the data file above.
537
Unlike other storage engines we do not "pack" our data. Since we
538
are about to do a general compression, packing would just be a waste of
539
CPU time. If the table has blobs they are written after the row in the order
537
Unlike other storage engines we do not "pack" our data. Since we
538
are about to do a general compression, packing would just be a waste of
539
CPU time. If the table has blobs they are written after the row in the order
608
608
MY_REPLACE_EXT | MY_UNPACK_FILENAME);
611
Here is where we open up the frm and pass it to archive to store
611
Here is where we open up the frm and pass it to archive to store
613
613
if ((frm_file= my_open(name_buff, O_RDONLY, MYF(0))) > 0)
628
628
if (create_info->comment.str)
629
azwrite_comment(&create_stream, create_info->comment.str,
629
azwrite_comment(&create_stream, create_info->comment.str,
630
630
(unsigned int)create_info->comment.length);
633
Yes you need to do this, because the starting value
633
Yes you need to do this, because the starting value
634
634
for the autoincrement may not be zero.
636
636
create_stream.auto_increment= stats.auto_increment_value ?
724
724
Look at ha_archive::open() for an explanation of the row format.
725
725
Here we just write out the row.
727
727
Wondering about start_bulk_insert()? We don't implement it for
728
728
archive since it optimizes for lots of writes. The only save
729
for implementing start_bulk_insert() is that we could skip
729
for implementing start_bulk_insert() is that we could skip
730
730
setting dirty to true each time.
732
732
int ha_archive::write_row(unsigned char *buf)
759
759
We don't support decremening auto_increment. They make the performance
762
if (temp_auto <= share->archive_write.auto_increment &&
762
if (temp_auto <= share->archive_write.auto_increment &&
763
763
mkey->flags & HA_NOSAME)
765
765
rc= HA_ERR_FOUND_DUPP_KEY;
770
Bad news, this will cause a search for the unique value which is very
771
expensive since we will have to do a table scan which will lock up
772
all other writers during this period. This could perhaps be optimized
770
Bad news, this will cause a search for the unique value which is very
771
expensive since we will have to do a table scan which will lock up
772
all other writers during this period. This could perhaps be optimized
777
777
First we create a buffer that we can use for reading rows, and can pass
937
This is the method that is used to read a row. It assumes that the row is
937
This is the method that is used to read a row. It assumes that the row is
938
938
positioned where you want it.
940
940
int ha_archive::get_row(azio_stream *file_to_read, unsigned char *buf)
957
957
if (length > record_buffer->length)
959
959
unsigned char *newptr;
960
if (!(newptr=(unsigned char*) my_realloc((unsigned char*) record_buffer->buffer,
960
if (!(newptr=(unsigned char*) my_realloc((unsigned char*) record_buffer->buffer,
962
962
MYF(MY_ALLOW_ZERO_PTR))))
1082
1082
The table can become fragmented if data was inserted, read, and then
1083
inserted again. What we do is open up the file and recompress it completely.
1083
inserted again. What we do is open up the file and recompress it completely.
1085
1085
int ha_archive::optimize(Session *, HA_CHECK_OPT *)
1100
1100
/* Lets create a file to contain the new data */
1101
fn_format(writer_filename, share->table_name, "", ARN,
1101
fn_format(writer_filename, share->table_name, "", ARN,
1102
1102
MY_REPLACE_EXT | MY_UNPACK_FILENAME);
1104
1104
if (!(azopen(&writer, writer_filename, O_CREAT|O_RDWR, AZ_METHOD_BLOCK)))
1105
return(HA_ERR_CRASHED_ON_USAGE);
1105
return(HA_ERR_CRASHED_ON_USAGE);
1108
An extended rebuild is a lot more effort. We open up each row and re-record it.
1109
Any dead rows are removed (aka rows that may have been partially recorded).
1108
An extended rebuild is a lot more effort. We open up each row and re-record it.
1109
Any dead rows are removed (aka rows that may have been partially recorded).
1111
1111
As of Archive format 3, this is the only type that is performed, before this
1112
1112
version it was just done on T_EXTEND
1117
Now we will rewind the archive file so that we are positioned at the
1117
Now we will rewind the archive file so that we are positioned at the
1118
1118
start of the file.
1120
1120
azflush(&archive, Z_SYNC_FLUSH);
1121
1121
rc= read_data_header(&archive);
1124
1124
On success of writing out the new header, we now fetch each row and
1125
insert it into the new archive file.
1125
insert it into the new archive file.
1195
1195
delayed_insert= false;
1197
if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
1197
if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
1200
1200
Here is where we get into the guts of a row level lock.
1202
1202
If we are not doing a LOCK Table or DISCARD/IMPORT
1203
TABLESPACE, then allow multiple writers
1203
TABLESPACE, then allow multiple writers
1206
1206
if ((lock_type >= TL_WRITE_CONCURRENT_INSERT &&
1208
1208
&& !session_tablespace_op(session))
1209
1209
lock_type = TL_WRITE_ALLOW_WRITE;
1212
1212
In queries of type INSERT INTO t1 SELECT ... FROM t2 ...
1213
1213
MySQL would use the lock TL_READ_NO_INSERT on t2, and that
1214
1214
would conflict with TL_WRITE_ALLOW_WRITE, blocking all inserts
1215
1215
to t2. Convert the lock to a normal read lock to allow
1216
concurrent inserts to t2.
1216
concurrent inserts to t2.
1219
if (lock_type == TL_READ_NO_INSERT && !session_in_lock_tables(session))
1219
if (lock_type == TL_READ_NO_INSERT && !session_in_lock_tables(session))
1220
1220
lock_type = TL_READ;
1222
1222
lock.type=lock_type;
1333
1333
We cancel a truncate command. The only way to delete an archive table is to drop it.
1334
This is done for security reasons. In a later version we will enable this by
1334
This is done for security reasons. In a later version we will enable this by
1335
1335
allowing the user to select a different row format.
1337
1337
int ha_archive::delete_all_rows()
1382
1382
set_session_proc_info(session, old_proc_info);
1384
if ((rc && rc != HA_ERR_END_OF_FILE))
1384
if ((rc && rc != HA_ERR_END_OF_FILE))
1386
1386
share->crashed= false;
1387
1387
return(HA_ADMIN_CORRUPT);
1404
1404
return(repair(session, &check_opt));
1407
archive_record_buffer *ha_archive::create_record_buffer(unsigned int length)
1407
archive_record_buffer *ha_archive::create_record_buffer(unsigned int length)
1409
1409
archive_record_buffer *r;
1411
1411
(archive_record_buffer*) my_malloc(sizeof(archive_record_buffer),