After a MemSQL cluster or node has been restarted, during the recovery stage in which transactions are replayed from the on-disk logs, one or more leaf nodes' recovery may fail and they will go offline due to out-of-memory errors. Memory usage during recovery also appears to be significantly higher than the expected considering the size of the data.
In the tracelog from one of the failed leaf nodes, a progression of errors similar to the following is seen:
ERROR: Replaying logs/sample_0_log_0: Nonfatal buffer manager memory allocation failure. The maximum_memory parameter (30159 MB) has been reached. ERROR: Replaying logs/sample_0_log_0: Replay log record: Failed to delete primary key (2) ERROR: Replaying logs/sample_0_log_0: DELETE log record for table 'sample' failed to replay (offset 145076456, data [f726fa0100000000], alpha [˜&˙.....], version 700) ERROR: Log file 0 of database sample_0 caused a replay failure at position 145076456. Unable to recover. ERROR: Recovery of database [sample_0] failed.
During replay, if there are large DELETE or UPDATE transactions in the transaction logs, these may end up being executed in parallel, resulting in more aggressive memory usage than the data in a steady state. If the table size is already running close to the maximum_table_memory limit, this usage can spike to a point where the leaf runs out of memory and goes offline due to failed recovery.
To mitigate this issue, lower maximum_table_memory to 75% of maximum_memory. The lower that maximum_table_memory can be made in relation to maximum_memory (while meeting constraints of required table size), the better. As a result of lowering this setting, resources will be freed earlier in the replay progression, preventing out-of-memory issues.
In order to make this change, on the master aggregator, run
memsql-ops memsql-update-config --key maximum_table_memory --value <new_value> --all
This will update the
maximum_table_memory parameter to the specified value for all nodes in the cluster.
Periodic snapshotting of the database (via the
SNAPSHOT DATABASE command) will also prevent this issue from occurring, as
UPDATE statements from the transaction logs will be compressed to the raw data rather than individual transactions.