FriendlyFixer here. It seems we talk about databases a lot right? I promise this one will get technical as well but I will refrain from any outright jokes. As always, there is a MUST READ section followed by a MAYBE READ section.
- RFB uses two different backend storage technologies, MySQL and Redis
- For security reasons we were required to upgrade MySQL to version 8 (from 5.7)
- Extensive testing was performed during upgrade and before releasing live
- At this time RFB started experiencing major game issues
- Marches getting stuck
- Heroes stuck in deployment
- Players being annoyed
- Extreme lag during Events (such as BM)
- Players locked out of game
- Assuming it was related to the upgrade we checked several things
- ‘diffs’ (checking for differences using a tool) against the configurations for 5.7 and 8
- Documentation on 5.7 vs 8 changes
- Profiling various parts of the game, specifically the ‘slow queries’
- Surprise! None of these did anything
- The ‘diffs’ showed no configuration changes
- The documents suggested everything was better in MySQL 8 land
- Profiling showed that some queries were slow
- Like, REALLY slow. Tens of seconds up to a minute
- That is an eternity in computer land
- We did more profiling with log outputs that broke down the parts of the slow queries
- We made changes to the code to reduce the reliance on a form of ‘locking’ behaviour the game was doing
At this point we were getting some rather urgent messages from players asking if we knew the game was broken. Some of our engineering effort had to shift to creating a method to ‘unstick’ players who were unable to play. Thankfully, that didn’t take too long and we got back to the ‘adventures in DB land: MySQL 8 edition’.
MUST READ (The SQL):
- We found a configuration difference
- A property had been renamed between MySQL 5.7 and 8
- Since we used the old name in our config, the property was set to its default value
- This default value was exactly the wrong one for RFB’s code
- Everyone rejoiced and the game was saved
- Then Blood Moon happened and the event was still broken
- More profiling
- More configuration checking
- More code changes
- More annoyed players
- Finally we discovered something useful in the profiling
- One of our engineers added the amount of time Redis was taking to the profile logs
- Redis was taking up to 100x as long to return as it did before
- This would only happen at random and only on the live servers during Blood Moon
- When it happened it would mean all following calls to Redis were delayed
- This delay would then cascade and cause more delays, until load lowered to a point the server could catch up
- We discovered a clue!
- A change had been made to read cached values for some points events instead of going to the database for them
- The mechanism for serving the cache… is Redis
- Redis servers had hit a network bandwidth limit
- The solution:
- Change the code to be less punishing on the cache and only read from it when it was needed
- Update the bandwidth limits on the Redis servers
I know this has been a long one with many ups and downs to follow through. I apologize for the disruption to the game and would like to emphasize how proud I am with the development team on RFB. They stuck with the problem right through to the end despite many obstacles. I’m also very happy to say that your tenacity as players has kept us caring about the game as well. Many times people worked well past reasonable hours investigating MANY dead ends trying to find a clue of where the actual problem was. They only did that because they wanted a game that worked and that players could enjoy.
Now for the rest of the technical details. Feel free to MAYBE READ this part:
I hit a lot of the major points in the above section and it reads like a timeline of the events and our discoveries. What I left out was some of the details of WHY was it such a big deal that one (or two) configuration values were different/default. That requires an understanding of some much more technical concepts.
First, RFB is what is known as a ‘highly opinionated tech stack’. That isn’t bad, just important to know. It means that it has VERY specific ways of doing things and they DO work. They just require that you don’t go around and break the WAY they work.
A real world example of this would be Formula One racing. This is HIGHLY opinionated on what makes a good race car. It needs to be light, super low profile with absurd down forces. It also needs to be able to be taken apart in seconds, not minutes or hours. If you started adding smoke machines and neon lights to this car you might make it look neat, but you would be breaking the fundamental rules that make it a good INDY car.
Well, RFB uses a part of MySQL that is SELECT * FOR UPDATE. As in, a select statement that grabs data from the database but with the added ‘for update’ bit on the end. What this does is it LOCKS the row in the database so that if anyone else tries to grab that data they will be blocked until the first person releases the lock. This whole lock timing bit is done inside of a Transaction.
Now, you can lock data ‘for update’ so that your code can do things with it and then write out the new values. That ensures that anyone reading the values will always get the latest ones and you can ignore a lot of fiddly timing bits for concurrent code. This is quite important in RFB since it is a multithreaded application where many different execution contexts can be running at once and they all need to interact with the database in some form. That’s fine and all but why does it matter in this case?
Well, there are other reasons you might use a lock. Imagine for a minute that you have a thread of execution used to log a player in. This does a lot of work. There are updates to process, rewards to grant, mail to fetch… you get it. A lot. In some cases it is just reading data while in others it is writing out data. The issue is that login can happen at the same time as a few other operations. Like returning from an event kingdom, or being awarded a prize, or any number of other things. Some of these use the same data that might be updated.
Now you could spend a lot of time writing code that is thread safe and data aware and never gets the timing wrong. Or, you can lock the whole set of data while you do your work because in computer land it will take less than 1 second and others can wait. This is where the opinionated nature of RFB comes in. It IS thread safe because it locks access to the DB for a player using this FOR UPDATE nature. The game does this by locking a completely unrelated table who’s only purpose is to be a way to identify a single player.
It’s a bit like having all your data in a warehouse and then having a shared key to that warehouse. You can’t get in if someone else is holding the key. Now, this is where it gets important. The default value for that config I mentioned? It was tx_isolation which was removed and replaced with transaction_isolation. That meant the default of REPEATABLE-READ was used instead of READ-COMMITTED. RFB Relies on this value being correct in much the same way a Formula one car needs downforce. It can work without it, but is much more unstable. Add to that one more configuration value of table_open_cache and now we had a reason for why MySQL was running into so much trouble.
Combined these two changes had made it much more likely that a waiting thread would stall and fail to acquire the lock. That would cause the transaction to fail and for all sorts of parts of the game to behave unexpectedly.
This has already gotten really long so I’ll leave it there. The issues with Redis I can describe in a follow on post if there is interest.
I’d like to wrap this up with a simple statement. Thank you for sticking with us.