There are two main data corruption cases that can occur in the master DB.
- Data
corruption due to the modification of the data stored in disk, not due
to SQL queries. For example, data mutations due to disk malfunctions,
filesystem errors, etc.
- Data corruption due to modifications
of the DB caused by an unwanted or incorrect SQL statements. For
example, a bug in our application might "destroy/corrupt" the contents
of our DB.
Verification: how do I know my system is in this situation?
If
you think that your DB is corrupted you probably have seen something
strange that you understand. If it's caused by disk or filesystem corruption,
you'll probably see error messages in the instance log...(i.e.,
syslogrelated or even mysql-server related). If the corruption is due
to case 2, you'll probably see that there is data missing, or perhaps
application errors complaining about missing relationships, records not
found or failed transactions.
Case B - Data on both master and slave DB is corrupted
It
is possible that the data corruption happens at the database contents level
rather than being due to errors at a lower level (i.e, disk,
filesystem...). This is commonly due to errors or bugs in the
application, which might cause things like inserting incorrect records
into the DB, deleting valuable ones or disconnecting table
relationships. Since these changes are valid from an SQL point of view,
they will be propagated to the slave instances immediately, therefore
both master and slave databases will most likely be corrupted in exactly the same
way.
The way to recover from this situation is to restore the
latest backup taken before such corruption took place. The recovery
mechanism for this case is somehow similar to the situation where there are
no running DB instances with the exception that we have to be more
careful about which backup file to restore from...since restoring from
the latest backup might not be a good solution if the backup was taken
after the corruption already existed.
Required actions: What recipes should I apply to solve the situation?
Overall Strategy: Existing database instances should be discarded and a new master-slave cluster should
be started. Basically we're disabling the current MySQL instances and starting from scratch.
- Disable automatic backups from existing databases. See Disable or enable continuous backups.
- Disable
existing databases to prevent additional application errors. Use an SSH console and
login to the instance(s) and execute the "service mysqld stop" action from the command
prompt.
- Launch a new master DB using the Bootstrap MySQL server template. See Launch a new master DB.
- Check out the restored database, and iterate (restoring older backups) until backup is not corrupted. See Restore a previous DB backup.
- If necessary, re-point or restart your application to use the new master (same dns name but different IP).
- Launch a brand new slave DB using the Additional MySQL server template. See Launch a new slave DB.
- Once the instance is operational, initialize the slave DB from master DB.
- Analyze old master/slave databases to determine the root cause of failure and then terminate the instance(s).
NOTE:
To locate the latest non-corrupted backup file, you must choose from
the available binary backup files in your S3 directory. The bucket and path
for these are specified by the BACKUPFILE_PREFIX and BACKUP_S3_BUCKET
input variables in our MySQL templates (viewable by locating either
master or slave MySQL templates, and then clicking on the 'show' link
next to input variables). It's important to know when the corruption occurred in order to ensure that you are restoring the most recent non-corrupted backup.