This blog post is intended to give all of you my take on the last few days. Since it's been such an exciting time for us here, I'd like to give my own perspective on it. This is a pretty tech-heavy post, so be warned.
Last Thursday I travelled to Paris to give a couple of interviews to the mainstream French gaming media about Perpetuum. All went well, and I'd like to seize this opportunity to send a big shout to all the participants for their kindness and professionalism.
Saturday morning, while roaming the streets of Paris, I started receiving messages and phone calls from the dev team at home about a nice amount of new registrations. As time went on the number just went up and up. Having no idea what's going on, we were simply happy about it and tried to craft theories around it. Later that day I met Guillaume - who is one of our players - and he explained everything about the current situation on the MMO landscape. (Thanks Guillaume for the hospitality, I owe you one!) In the meantime, I was being spammed with messages that something weird was going on with the relay server, causing insane lags. You can imagine the number of calls, as even the battery ran out in my mobile!
Sunday night I got back to Budapest, went home, took a shower and immediately headed to the office to check on the situation and fix the server problem. In the office I found out Calvin came to pick me up at the airport, but since my phone died on me earlier, we managed to miss each other. From this point on, I completely lost track of days and hours with almost no sleep only focusing on the server issue.
We checked everything: optimized the SQL, implemented new caches on the server, tried many other things, but they were minor problems compared to the evil seed that caused the problem. The main difficulty is that it's very hard to generate load on the dev server similar to the live one. For the tech guys out there, the transaction coordinator (MSDTC) was NOT the problem, it causes an insignificant load, so we are fine with that.
We had to put in two nights in a row. We simply had to rest, but it was made hard by the stress and the guilt of leaving the gimped server alone, not knowing what would await us when we wake up. Things were looking grim.
When we realized the seriousness of the problem, I contacted one of the most knowledgeable people I know in this field and asked him to help out as a fresh mind always has a better chance finding hard bugs. Our deepest respect Soci! (Shameless promotion: http://soci.hu/). Luckily, he had time to check out our architecture and together we were able to start an investigation session on Wednesday night. He gave us several suggestions, pointed out and helped optimize several things in the database layer. Then we moved on to inspect the relay server's code. Since the source is huge, the quickest and most realistic method was to attach an analyzer to the live server application. I must admit this was the last thing I wanted to do on my own, but at this point I was willing to sacrifice anything to find the root of the problem.
He instantly figured out that one innocent-looking little function, namely the one which returns who is online (in chat channels, for example) has an emergent behavior, resulting in an exponential load and suffocating the server. (Insert random sarcasm about glaring oversight and tech madness here.) We then ran Visual Studio performance monitor on the live server to dig down to the heart of it. Soci said it might cause some load, so we messaged the server with a warning about what we were doing. And then we accidentally the whole server. :) 0245 server time, bye-bye field containers!
Thursday we closed the session with Soci and went back to implement what we'd learned. Quick tests on the dev server showed pretty amazing results so it made us rather confident. With shaky hands we patched and let the first 50 players in. This was the moment of success! So we immediately let 100 more players in. During this period we constantly checked the load which became insignificant (2-3%) so we raised finally the cap to JUST OVER 9000!!!!! >:)
The time to chill out has finally come, so we popped open one of Gargaj’s precious Norwegian treasures that had been sitting on our shelves for some years, a bottle of blackcurrant wine. I don’t know if it was the quality of the booze or the grace of that moment, but it was one of the sweetest sips we ever had.
Well, that’s the end of that, hopefully I managed to shed some light on the problem and our day-to-day march until we reached victory. The code belongs to us, but the world of Nia is yours. Enjoy!
Comments for this post
1 Snowman
No doubt you will encounter future difficulties but for now, rest well o7
2 Arga
3 Dan
4 Wannes Jah
I hope you can get some well deserved rest to recuperate!
5 Kynes
6 DEV Crm
7 San Vigil
HAHAHAHAHA +3 internets to Dev Crm for proper use of an awesome internet meme!! LOL
Great job to the whole team on fixing the issue guys! Hats off to your hard work and bold moves to save my GoBots! :)
8 Antiquado deLune
9 Josefius
10 Andrevich Tolstoy
I loved the blog, too!
11 Sarah Haran
12 Eta Carinea
Soci sounds like a good guy to have around, and dont worry about getting to Techi, i for one enjoy these types of blogs.
Eta
13 Ralph Law
14 Winter Solstice
15 Johnny EvilGuy
16 Gaulois
17 Marlona Sky
18 Owen
Keep it up fellows !
19 Zap Kalan
Great work - hope you keep it like this ;)
PS: i bought my first 30 days *g*
20 Twiz
21 excession
22 Zex Maxwell
23 Dan
24 Lonwolf
Keep up the great work my friends!
25 Theo CN
It was a very nice read.