This blog post is intended to give all of you my take on the last few days. Since it's been such an exciting time for us here, I'd like to give my own perspective on it. This is a pretty tech-heavy post, so be warned.
Last Thursday I travelled to Paris to give a couple of interviews to the mainstream French gaming media about Perpetuum. All went well, and I'd like to seize this opportunity to send a big shout to all the participants for their kindness and professionalism.
Saturday morning, while roaming the streets of Paris, I started receiving messages and phone calls from the dev team at home about a nice amount of new registrations. As time went on the number just went up and up. Having no idea what's going on, we were simply happy about it and tried to craft theories around it. Later that day I met Guillaume - who is one of our players - and he explained everything about the current situation on the MMO landscape. (Thanks Guillaume for the hospitality, I owe you one!) In the meantime, I was being spammed with messages that something weird was going on with the relay server, causing insane lags. You can imagine the number of calls, as even the battery ran out in my mobile!
Sunday night I got back to Budapest, went home, took a shower and immediately headed to the office to check on the situation and fix the server problem. In the office I found out Calvin came to pick me up at the airport, but since my phone died on me earlier, we managed to miss each other. From this point on, I completely lost track of days and hours with almost no sleep only focusing on the server issue.
We checked everything: optimized the SQL, implemented new caches on the server, tried many other things, but they were minor problems compared to the evil seed that caused the problem. The main difficulty is that it's very hard to generate load on the dev server similar to the live one. For the tech guys out there, the transaction coordinator (MSDTC) was NOT the problem, it causes an insignificant load, so we are fine with that.
We had to put in two nights in a row. We simply had to rest, but it was made hard by the stress and the guilt of leaving the gimped server alone, not knowing what would await us when we wake up. Things were looking grim.
When we realized the seriousness of the problem, I contacted one of the most knowledgeable people I know in this field and asked him to help out as a fresh mind always has a better chance finding hard bugs. Our deepest respect Soci! (Shameless promotion: http://soci.hu/). Luckily, he had time to check out our architecture and together we were able to start an investigation session on Wednesday night. He gave us several suggestions, pointed out and helped optimize several things in the database layer. Then we moved on to inspect the relay server's code. Since the source is huge, the quickest and most realistic method was to attach an analyzer to the live server application. I must admit this was the last thing I wanted to do on my own, but at this point I was willing to sacrifice anything to find the root of the problem.
He instantly figured out that one innocent-looking little function, namely the one which returns who is online (in chat channels, for example) has an emergent behavior, resulting in an exponential load and suffocating the server. (Insert random sarcasm about glaring oversight and tech madness here.) We then ran Visual Studio performance monitor on the live server to dig down to the heart of it. Soci said it might cause some load, so we messaged the server with a warning about what we were doing. And then we accidentally the whole server. :) 0245 server time, bye-bye field containers!
Thursday we closed the session with Soci and went back to implement what we'd learned. Quick tests on the dev server showed pretty amazing results so it made us rather confident. With shaky hands we patched and let the first 50 players in. This was the moment of success! So we immediately let 100 more players in. During this period we constantly checked the load which became insignificant (2-3%) so we raised finally the cap to JUST OVER 9000!!!!! >:)
The time to chill out has finally come, so we popped open one of Gargaj’s precious Norwegian treasures that had been sitting on our shelves for some years, a bottle of blackcurrant wine. I don’t know if it was the quality of the booze or the grace of that moment, but it was one of the sweetest sips we ever had.
Well, that’s the end of that, hopefully I managed to shed some light on the problem and our day-to-day march until we reached victory. The code belongs to us, but the world of Nia is yours. Enjoy!