NAME
DBIx::BatchChunker - Run large database changes safely
VERSION
version v1.0.0
SYNOPSIS
use DBIx::BatchChunker;
my $account_rs = $schema->resultset('Account')->search({
account_type => 'deprecated',
});
my %params = (
chunk_size => 5000,
target_time => 5,
rs => $account_rs,
id_name => 'account_id',
coderef => sub { $_[1]->delete },
sleep => 1,
verbose => 1,
progress_name => 'Deleting deprecated accounts',
process_past_max => 1,
);
# EITHER:
# 1) Automatically construct and execute the changes:
DBIx::BatchChunker->construct_and_execute(%params);
# OR
# 2) Manually construct and execute the changes:
my $batch_chunker = DBIx::BatchChunker->new(%params);
$batch_chunker->calculate_ranges;
$batch_chunker->execute;
DESCRIPTION
This utility class is for running a large batch of DB changes in a
manner that doesn't cause huge locks, outages, and missed transactions.
It's highly flexible to allow for many different kinds of change
operations, and dynamically adjusts chunks to its workload.
It works by splitting up DB operations into smaller chunks within a
loop. These chunks are transactionalized, either naturally as
single-operation bulk work or by the loop itself. The full range is
calculated beforehand to get the right start/end points. A progress bar
will be created to let the deployer know the processing status.
There are two ways to use this class: call the automatic constructor
and executor ("construct_and_execute") or manually construct the object
and call its methods. See "SYNOPSIS" for examples of both.
DISCLAIMER: You should not rely on this class to magically fix any and
all locking problems the DB might experience just because it's being
used. Thorough testing and best practices are still required.
Processing Modes
This class has several different modes of operation, depending on what
was passed to the constructor:
DBIC Processing
If both "rs" and "coderef" are passed, a chunk ResultSet is built from
the base ResultSet, to add in a BETWEEN clause, and the new ResultSet
is passed into the coderef. The coderef should run some sort of active
ResultSet operation from there.
An "id_name" should be provided, but if it is missing it will be looked
up based on the primary key of the ResultSource.
If "single_rows" is also enabled, then each chunk is wrapped in a
transaction and the coderef is called for each row in the chunk. In
this case, the coderef is passed a Result object instead of the chunk
ResultSet.
Note that whether "single_rows" is enabled or not, the coderef
execution is encapsulated in DBIC's retry logic, so any failures will
re-connect and retry the coderef. Because of this, any changes you make
within the coderef should be idempotent, or should at least be able to
skip over any already-processed rows.
Active DBI Processing
If an "stmt" (DBI statement handle args) is passed without a "coderef",
the statement handle is merely executed on each iteration with the
start and end IDs. It is assumed that the SQL for the statement handle
contains exactly two placeholders for a BETWEEN clause. For example:
my $update_stmt = q{
UPDATE
accounts a
JOIN account_updates au USING (account_id)
SET
a.time_stamp = au.time_stamp
WHERE
a.account_id BETWEEN ? AND ? AND
a.time_stamp != au.time_stamp
});
The BETWEEN clause should, of course, match the IDs being used in the
loop.
The statement is ran with "dbi_connector" for retry protection.
Therefore, the statement should also be idempotent.
Query DBI Processing
If both a "stmt" and a "coderef" are passed, the statement handle is
prepared and executed. Like the "Active DBI Processing" mode, the SQL
for the statement should contain exactly two placeholders for a BETWEEN
clause. Then the $sth is passed to the coderef. It's up to the coderef
to extract data from the executed statement handle, and do something
with it.
If single_rows is enabled, each chunk is wrapped in a transaction and
the coderef is called for each row in the chunk. In this case, the
coderef is passed a hashref of the row instead of the executed $sth,
with lowercase alias names used as keys.
Note that in both cases, the coderef execution is encapsulated in a
DBIx::Connector::Retry call to either run or txn (using
"dbi_connector"), so any failures will re-connect and retry the
coderef. Because of this, any changes you make within the coderef
should be idempotent, or should at least be able to skip over any
already-processed rows.
DIY Processing
If a "coderef" is passed but neither a stmt nor a rs are passed, then
the multiplier loop does not touch the database. The coderef is merely
passed the start and end IDs for each chunk. It is expected that the
coderef will run through all database operations using those start and
end points.
It's still valid to include "min_stmt", "max_stmt", and/or "count_stmt"
in the constructor to enable features like max ID recalculation or
chunk resizing.
If you're not going to include any min/max statements for
"calculate_ranges", you will need to set "min_id" and "max_id"
yourself, either in the constructor or before the "execute" call. Using
"construct_and_execute" is also not an option in this case, as this
tries to call "calculate_ranges" without a way to do so.
TL;DR Version
$stmt = Active DBI Processing
$stmt + $coderef = Query DBI Processing | $bc->$coderef($executed_sth)
$stmt + $coderef + single_rows=>1 = Query DBI Processing | $bc->$coderef($row_hashref)
$rs + $coderef = DBIC Processing | $bc->$coderef($chunk_rs)
$rs + $coderef + single_rows=>1 = DBIC Processing | $bc->$coderef($result)
$coderef = DIY Processing | $bc->$coderef($start, $end)
ATTRIBUTES
See the "METHODS" section for more in-depth descriptions of these
attributes and their usage.
DBIC Processing Attributes
rs
A DBIx::Class::ResultSet. This is used by all methods as the base
ResultSet onto which the DB changes will be applied. Required for DBIC
processing.
rsc
A DBIx::Class::ResultSetColumn. This is only used to override "rs" for
min/max calculations. Optional.
dbic_retry_opts
A hashref of DBIC retry options. These options control how retry
protection works within DBIC. So far, there are two supported options:
max_attempts = Number of times to retry
retry_handler = Coderef that returns true to continue to retry or false to re-throw
the last exception
The default is to let the DBIC storage engine handle its own
protection, which will retry once if the DB connection was
disconnected. If you specify any options, even a blank hashref,
BatchChunker will fill in a default max_attempts of 10, and an
always-true retry_handler. This is similar to DBIx::Connector::Retry's
defaults.
Under the hood, these are options that are passed to the
as-yet-undocumented DBIx::Class::Storage::BlockRunner. The
retry_handler has access to the same BlockRunner object (passed as its
only argument) and its methods/accessors, such as storage,
failed_attempt_count, and last_exception.
DBI Processing Attributes
dbi_connector
A DBIx::Connector::Retry object. Instead of DBI statement handles, this
is the recommended way for BatchChunker to interface with the DBI, as
it handles retries on failures. The connection mode used is whatever
default is set within the object.
Required for DBI Processing, unless "dbic_storage" is specified.
dbic_storage
A DBIC storage object, as an alternative for "dbi_connector". There may
be times when you want to run plain DBI statements, but are still using
DBIC. In these cases, you don't have to create a DBIx::Connector::Retry
object to run those statements.
This uses a BlockRunner object for retry protection, so the options in
"dbic_retry_opts" would apply here.
Required for DBI Processing, unless "dbi_connector" is specified.
min_stmt
max_stmt
SQL statement strings or an arrayref of parameters for
"selectrow_array" in DBI.
When executed, these statements should each return a single value,
either the minimum or maximum ID that will be affected by the DB
changes. These are used by "calculate_ranges". Required if using either
type of DBI Processing.
stmt
A SQL statement string or an arrayref of parameters for "prepare" in
DBI + binds.
If using "Active DBI Processing" (no coderef), this is a do-able
statement (usually DML like INSERT/UPDATE/DELETE). If using "Query DBI
Processing" (with coderef), this is a passive DQL (SELECT) statement.
In either case, the statement should contain BETWEEN placeholders,
which will be executed with the start/end ID points. If there are
already bind placeholders in the arrayref, then make sure the BETWEEN
bind points are last on the list.
Required for DBI Processing.
count_stmt
A SELECT COUNT SQL statement string or an arrayref of parameters for
"selectrow_array" in DBI.
Like "stmt", it should contain BETWEEN placeholders. In fact, the SQL
should look exactly like the "stmt" query, except with COUNT(*) instead
of the column list.
Used only for "Query DBI Processing". Optional, but recommended for
chunk resizing.
Progress Bar Attributes
progress_bar
The progress bar used for all methods. This can be specified right
before the method call to override the default used for that method.
Unlike most attributes, this one is read-write, so it can be switched
on-the-fly.
Don't forget to remove or switch to a different progress bar if you
want to use a different one for another method:
$batch_chunker->progress_bar( $calc_pb );
$batch_chunker->calculate_ranges;
$batch_chunker->progress_bar( $loop_pb );
$batch_chunker->execute;
All of this is optional. If the progress bar isn't specified, the
method will create a default one. If the terminal isn't interactive,
the default Term::ProgressBar will be set to silent to naturally skip
the output.
progress_name
A string used by "execute" to assist in creating a progress bar.
Ignored if "progress_bar" is already specified.
This is the preferred way of customizing the progress bar without
having to create one from scratch.
cldr
A CLDR::Number object. English speakers that use a typical 1,234.56
format would probably want to leave it at the default. Otherwise, you
should provide your own.
verbose
Boolean. By default, this is on, which displays timing stats on each
chunk, as well as total numbers. This is still subject to
non-interactivity checks from "progress_bar".
(This was previously defaulted to off, and called debug, prior to
v1.0.0.)
Common Attributes
id_name
The column name used as the iterator in the processing loops. This
should be a primary key or integer-based (indexed) key, tied to the
resultset.
Optional. Used mainly in DBIC processing. If not specified, it will
look up the first primary key column from "rs" and use that.
This can still be specified for other processing modes to use in
progress bars.
coderef
The coderef that will be called either on each chunk or each row,
depending on how "single_rows" is set. The first input is always the
BatchChunker object. The rest vary depending on the processing mode:
$stmt + $coderef = Query DBI Processing | $bc->$coderef($executed_sth)
$stmt + $coderef + single_rows=>1 = Query DBI Processing | $bc->$coderef($row_hashref)
$rs + $coderef = DBIC Processing | $bc->$coderef($chunk_rs)
$rs + $coderef + single_rows=>1 = DBIC Processing | $bc->$coderef($result)
$coderef = DIY Processing | $bc->$coderef($start, $end)
The loop does not monitor the return values from the coderef.
Required for all processing modes except "Active DBI Processing".
chunk_size
The amount of rows to be processed in each loop.
This figure should be sized to keep per-chunk processing time at around
5 seconds. If this is too large, rows may lock for too long. If it's
too small, processing may be unnecessarily slow.
Default is 1 row, which is only appropriate if "target_time" (on by
default) is enabled. This will cause the processing to slowly ramp up
to the target time as BatchChunker gathers more data.
Otherwise, if you using static chunk sizes with target_time turned off,
figure out the right chunk size with a few test runs and set it here.
(This was previously defaulted to 1000 rows, prior to v1.0.0.)
target_time
The target runtime (in seconds) that chunk processing should strive to
achieve, not including "sleep". If the chunk processing times are too
high or too low, this will dynamically adjust "chunk_size" to try to
match the target.
BatchChunker will still use the initial chunk_size, and it will need at
least one chunk processed, before it makes adjustments. If the starting
chunk size is grossly inaccurate to the workload, you could end up with
several chunks in the beginning causing long-lasting locks before the
runtime targeting reduces them down to a reasonable size.
(Chunk size reductions are prioritized before increases, so it should
re-size as soon as it finds the problem. But, one bad chunk could be
all it takes to cause an outage.)
Default is 5 seconds. Set this to zero to turn off runtime targeting.
(This was previously defaulted to off prior to v0.92, and set to 15 in
v0.92.)
sleep
The number of seconds to sleep after each chunk. It uses Time::HiRes's
version, so fractional numbers are allowed.
Default is 0.5 seconds, which is fine for most operations. You can
likely get away with zero for smaller operations, but test it out
first. If processing is going to take up a lot of disk I/O, you may
want to consider a higher setting. If the database server spends too
much time on processing, the replicas may have a hard time keeping up
with standard load.
This will increase the overall processing time of the loop, so try to
find a balance between the two.
(This was previously defaulted to 0 seconds, prior to v1.0.0.)
process_past_max
Boolean that controls whether to check past the "max_id" during the
loop. If the loop hits the end point, it will run another maximum ID
check in the DB, and adjust max_id accordingly. If it somehow cannot
run a DB check (no "rs" or "max_stmt" available, for example), the last
chunk will just be one at the end of max_id + chunk_size.
This is useful if the entire table is expected to be processed, and you
don't want to miss any new rows that come up between "calculate_ranges"
and the end of the loop.
Turned off by default.
single_rows
Boolean that controls whether single rows are passed to the "coderef"
or the chunk's ResultSets/statement handle is passed.
Since running single-row operations in a DB is painfully slow (compared
to bulk operations), this also controls whether the entire set of
coderefs are encapsulated into a DB transaction. Transactionalizing the
entire chunk brings the speed, and atomicity, back to what a bulk
operation would be. (Bulk operations are still faster, but you can't do
anything you want in a single DML statement.)
Used only by "DBIC Processing" and "Query DBI Processing".
min_chunk_percent
The minimum row count, as a percentage of "chunk_size". This value is
actually expressed in decimal form, i.e.: between 0 and 1.
This value will be used to determine when to process, skip, or expand a
block, based on a count query. The default is 0.5 or 50%, which means
that it will try to expand the block to a larger size if the row count
is less than 50% of the chunk size. Zero-sized blocks will be skipped
entirely.
This "chunk resizing" is useful for large regions of the table that
have been deleted, or when the incrementing ID has large gaps in it for
other reasons. Wasting time on numerical gaps that span millions can
slow down the processing considerably, especially if "sleep" is
enabled.
If this needs to be disabled, set this to 0. The maximum chunk
percentage does not have a setting and is hard-coded at 100% +
min_chunk_percent.
If DBIC processing isn't used, "count_stmt" is also required to enable
chunk resizing.
min_id
max_id
Used by "execute" to figure out the main start and end points.
Calculated by "calculate_ranges".
Manually setting this is not recommended, as each database is different
and the information may have changed between the DB change development
and deployment. Instead, use "calculate_ranges" to fill in these values
right before running the loop.
loop_state
A DBIx::BatchChunker::LoopState object designed to hold variables
during the processing loop. The object will be cleared out after use.
Most of the complexity is needed for chunk resizing.
CONSTRUCTORS
See "ATTRIBUTES" for information on what can be passed into these
constructors.
new
my $batch_chunker = DBIx::BatchChunker->new(...);
A standard object constructor. If you use this constructor, you will
need to manually call "calculate_ranges" and "execute" to execute the
DB changes.
construct_and_execute
my $batch_chunker = DBIx::BatchChunker->construct_and_execute(...);
Constructs a DBIx::BatchChunker object and automatically calls
"calculate_ranges" and "execute" on it. Anything passed to this method
will be passed through to the constructor.
Returns the constructed object, post-execution. This is typically only
useful if you want to inspect the attributes after the process has
finished. Otherwise, it's safe to just ignore the return and throw away
the object immediately.
METHODS
calculate_ranges
my $batch_chunker = DBIx::BatchChunker->new(
rsc => $account_rsc, # a ResultSetColumn
### OR ###
rs => $account_rs, # a ResultSet
id_name => 'account_id', # can be looked up if not provided
### OR ###
dbi_connector => $conn, # DBIx::Connector::Retry object
min_stmt => $min_stmt, # a SQL statement or DBI $sth args
max_stmt => $max_stmt, # ditto
### Optional but recommended ###
id_name => 'account_id', # will also be added into the progress bar title
chunk_size => 20_000, # default is 1000
### Optional ###
progress_bar => $progress, # defaults to a 2-count 'Calculating ranges' bar
# ...other attributes for execute...
);
my $has_data_to_process = $batch_chunker->calculate_ranges;
Given a DBIx::Class::ResultSetColumn, DBIx::Class::ResultSet, or DBI
statement argument set, this method calculates the min/max IDs of those
objects. It fills in the "min_id" and "max_id" attributes, based on the
ID data, and then returns 1.
If either of the min/max statements don't return any ID data, this
method will return 0.
execute
my $batch_chunker = DBIx::BatchChunker->new(
# ...other attributes for calculate_ranges...
dbi_connector => $conn, # DBIx::Connector::Retry object
stmt => $do_stmt, # INSERT/UPDATE/DELETE $stmt with BETWEEN placeholders
### OR ###
dbi_connector => $conn, # DBIx::Connector::Retry object
stmt => $select_stmt, # SELECT $stmt with BETWEEN placeholders
count_stmt => $count_stmt, # SELECT COUNT $stmt to be used for min_chunk_percent; optional
coderef => $coderef, # called code that does the actual work
### OR ###
rs => $account_rs, # base ResultSet, which gets filtered with -between later on
id_name => 'account_id', # can be looked up if not provided
coderef => $coderef, # called code that does the actual work
### OR ###
coderef => $coderef, # DIY database work; just pass the $start/$end IDs
### Optional but recommended ###
sleep => 0.25, # number of seconds to sleep each chunk; defaults to 0
process_past_max => 1, # use this if processing the whole table
single_rows => 1, # does $coderef get a single $row or the whole $chunk_rs / $stmt
min_chunk_percent => 0.25, # minimum row count of chunk size percentage; defaults to 0.5 (or 50%)
target_time => 5, # target runtime for dynamic chunk size scaling; default is 5 seconds
progress_name => 'Updating Accounts', # easier than creating your own progress_bar
### Optional ###
progress_bar => $progress, # defaults to "Processing $source_name" bar
verbose => 1, # displays timing stats on each chunk
);
$batch_chunker->execute if $batch_chunker->calculate_ranges;
Applies the configured DB changes in chunks. Runs through the loop,
processing a statement handle, ResultSet, and/or coderef as it goes.
Each loop iteration processes a chunk of work, determined by
"chunk_size".
The "calculate_ranges" method should be run first to fill in "min_id"
and "max_id". If either of these are missing, the function will assume
"calculate_ranges" couldn't find them and warn about it.
More details can be found in the "Processing Modes" and "ATTRIBUTES"
sections.
PRIVATE METHODS
_process_block
Runs the DB work and passes it to the coderef. Its return value
determines whether the block should be processed or not.
_process_past_max_checker
Checks to make sure the current endpoint is actually the end, by
checking the database. Its return value determines whether the block
should be processed or not.
See "process_past_max".
_chunk_count_checker
Checks the chunk count to make sure it's properly sized. If not, it
will try to shrink or expand the current chunk (in chunk_size
increments) as necessary. Its return value determines whether the block
should be processed or not.
See "min_chunk_percent".
This is not to be confused with the "_runtime_checker", which adjusts
chunk_size after processing, based on previous run times.
_runtime_checker
Stores the previously processed chunk's runtime, and then adjusts
chunk_size as necessary.
See "target_time".
_increment_progress
Increments the progress bar.
_print_chunk_status
Prints out a standard chunk status line, if "verbose" is enabled. What
it prints is generally uniform, but it depends on the processing
action. Most of the data is pulled from "loop_state".
CAVEATS
Big Number Support
If the module detects that the ID numbers are no longer safe for
standard Perl NV storage, it will automatically switch to using
Math::BigInt and Math::BigFloat for big number support. If any blessed
numbers are already being used to define the attributes, this will also
switch on the support.
String-based IDs
If you're working with VARCHAR types or other string-based IDs to
represent integers, these may be subject to whatever string-based
comparison rules your RDBMS uses when calculating with MIN/MAX or using
BETWEEN. Row counting and chunk size scaling will try to compensate,
but will be mixing string-based comparisons from the RDBMS and
Perl-based integer math.
Using the CAST function may help, but it may also cause critical
indexes to be ignored, especially if the function is used on the
left-hand side against the column. Strings with the exact same length
may be safe from comparison weirdness, but YMMV.
Non-integer inputs from ID columns, such as GUIDs or other alphanumeric
strings, are not currently supported. They would have to be converted
to integers via SQL, and doing so may run into a similar risk of having
your RDBMS ignore indexes.
SEE ALSO
DBIx::BulkLoader::Mysql, DBIx::Class::BatchUpdate, DBIx::BulkUtil
AUTHOR
Grant Street Group <developers@grantstreet.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018 - 2022 by Grant Street Group.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)