MySQL replication is great, and kind of reliable, but sometimes it could be messed up. The good news is that we can handle this.
Let's see how replication happens when everything is fine!
A transaction gets committed on the master database, and after it was written to the MySQL transaction log file, it will be written to the binary log too. The binary log contains two types of records: if the replication is statement based, it will contain the SQL statements; if the replication is row based, it will contain the changed rows. Both have pros and cons, so there is a mixed mode too. The replication behaviour can be switched at session level.
If the replication is statement based, then we don't have to deal with huge replication traffic, but we have to count with two factors: first, the the CPU load of the replica will be increased, because it has to run the queries one by one. Second, there could be differences in the query results. For example, when a table is getting heavily written, the replica database can be three seconds behind the master database. Now, we want to insert the result of a SYSDATE() function to the database. If we use statement-based replication, there will be two different timestamps on master and on replica! (Note: the NOW() function is aware for this, so that wouldn't be a good example). So, statement based replication is not really safe on data consistency.
Row based replication is a different thing. All of your writes appear as raw data, which is inserted to the same place on all the replicas. Row based replication also has its caveat: if you run a query that returns a lots of data, then this result set will be replicated to the replicas row-by-row. For example: you want to delete all the data before a given timestamp. This is one replication event in the statement based binary log, but it could be millions of events in a row based log.
In mixed mode, the database server tries to determine which method is the optimal for a given query.
There might be some cases where the data change on master isn't written to the binary log, for example if we switch the current statement as 'off binary log':
mysql> set sql_log_bin = 0; mysql> [... some commands ...] mysql> set sql_log_bin = 1;
And there is one last scenario when the data differs between replica and master: when we hit a bug. Recently we hit a MySQL bug which caused some records not appearing on the replicas. It turned out that when a binary log event's size was exactly 32kb, then this event was not written to the binary log. (This bug was fixed in 5.6.20.)
This behaviour can lead to some issues, like data not appearing on the site. Even if this is not too big, it could break replication: If you want to update a record which does not exist on the replica, the SQL thread will stop there, and throw an error - because there is no record to update. And broken replication means there will be no changes on the site, no new posts or comments appearing, etc.
As you see, we can't say for sure if there are no differences between the master database and its replicas, so we have to do something to minimize the impact of errors like this.
In the percona toolkit we have two utilities that can do this for us:
The pt-table-checksum does what its name suggests: it will do a table checksum on the database master and its replicas, and write the results to another database table. It splits the table to small chunks that can be checksummed in a short period of time (the default is 0.5 sec), and writes the checksum data to a table as master_crc and slave_crc. There comes the trick! On the replica, the data written to the slave_crc will be calculated on the slave itself (remember statement-based replication?), so if there are any differences between slave and master, they will appear here.
If there are any differences found, we can use the pt-table-sync utility to fix missing records on the slaves. It checks the pt-table-checksum result table to identify which rows need to be synced, and it fixes them by issuing the needed commands on master; or it can print the queries to the standard output which need to be run manually.
At Kinja, I've created a Jenkins job that triggers a script on the master database. This does the table checksum operation, and if it finds any differences, it will print out how they could be fixed. All the squirrels are happy.
There are some minor issues which we have to be aware of:
During the checksumming operation the replication lag is slightly increased: the normal 0.2 - 0.4 second lag is increased to 0.7 - 1,4 seconds. (This operation right now lasts for about 4,5 hours.)
The pt-table-checksum can only check the first level slaves, so if there are any differences found on db-master.xyz, then we have to checksum the secondary datacenter as well.
As I said, it checks the database in ~0.5 second chunks, so it has to estimate which chunk size will be optimal for this. When it reaches the kinja.post table it estimates the first few chunks wrong, and they need more time to finish. That means that in the 4,5-hour long operation we have one ~120 seconds replication lag spike.
(Eszpee was so kind and helped me fixing my errors in grammar.)