This blog post is intended to give all of you my take on the last few days. Since it's been such an exciting time for us here, I'd like to give my own perspective on it. This is a pretty tech-heavy post, so be warned.

Last Thursday I travelled to Paris to give a couple of interviews to the mainstream French gaming media about Perpetuum. All went well, and I'd like to seize this opportunity to send a big shout to all the participants for their kindness and professionalism.

Something you don't want to see.

Saturday morning, while roaming the streets of Paris, I started receiving messages and phone calls from the dev team at home about a nice amount of new registrations. As time went on the number just went up and up. Having no idea what's going on, we were simply happy about it and tried to craft theories around it. Later that day I met Guillaume - who is one of our players - and he explained everything about the current situation on the MMO landscape. (Thanks Guillaume for the hospitality, I owe you one!) In the meantime, I was being spammed with messages that something weird was going on with the relay server, causing insane lags. You can imagine the number of calls, as even the battery ran out in my mobile!

Sunday night I got back to Budapest, went home, took a shower and immediately headed to the office to check on the situation and fix the server problem. In the office I found out Calvin came to pick me up at the airport, but since my phone died on me earlier, we managed to miss each other. From this point on, I completely lost track of days and hours with almost no sleep only focusing on the server issue.

We checked everything: optimized the SQL, implemented new caches on the server, tried many other things, but they were minor problems compared to the evil seed that caused the problem. The main difficulty is that it's very hard to generate load on the dev server similar to the live one. For the tech guys out there, the transaction coordinator (MSDTC) was NOT the problem, it causes an insignificant load, so we are fine with that.

We had to put in two nights in a row. We simply had to rest, but it was made hard by the stress and the guilt of leaving the gimped server alone, not knowing what would await us when we wake up. Things were looking grim.

When we realized the seriousness of the problem, I contacted one of the most knowledgeable people I know in this field and asked him to help out as a fresh mind always has a better chance finding hard bugs. Our deepest respect Soci! (Shameless promotion: http://soci.hu/). Luckily, he had time to check out our architecture and together we were able to start an investigation session on Wednesday night. He gave us several suggestions, pointed out and helped optimize several things in the database layer. Then we moved on to inspect the relay server's code. Since the source is huge, the quickest and most realistic method was to attach an analyzer to the live server application. I must admit this was the last thing I wanted to do on my own, but at this point I was willing to sacrifice anything to find the root of the problem.

He instantly figured out that one innocent-looking little function, namely the one which returns who is online (in chat channels, for example) has an emergent behavior, resulting in an exponential load and suffocating the server. (Insert random sarcasm about glaring oversight and tech madness here.) We then ran Visual Studio performance monitor on the live server to dig down to the heart of it. Soci said it might cause some load, so we messaged the server with a warning about what we were doing. And then we accidentally the whole server. :) 0245 server time, bye-bye field containers!

Thursday we closed the session with Soci and went back to implement what we'd learned. Quick tests on the dev server showed pretty amazing results so it made us rather confident. With shaky hands we patched and let the first 50 players in. This was the moment of success! So we immediately let 100 more players in. During this period we constantly checked the load which became insignificant (2-3%) so we raised finally the cap to JUST OVER 9000!!!!! >:)

The time to chill out has finally come, so we popped open one of Gargaj’s precious Norwegian treasures that had been sitting on our shelves for some years, a bottle of blackcurrant wine. I don’t know if it was the quality of the booze or the grace of that moment, but it was one of the sweetest sips we ever had.

Well, that’s the end of that, hopefully I managed to shed some light on the problem and our day-to-day march until we reached victory. The code belongs to us, but the world of Nia is yours. Enjoy!

This morning while we were listening to the 6th episode of the Incoming Transmission Podcast, I realised I must clean up something about the market. In the beginning, the insurance prices were based on the actual trading price of the robots. (Plus a multiplier to be correct.) This was a very nice dynamic system on paper, and a rather pivotal part of the game. By that time we were too busy to check the market rates graph frequently, and on top of that when I checked it I fed myself with freshly crafted theories to explain why does that look odd.

Despite many warnings from many players I didn't do anything, because I was so convinced by my own little theories. This was a terrible mistake that resulted in insurance frauds. (... link to a very long story here ... ) I can talk about this for hours but to keep it short, I was so sure it would all be fine, the game is just starting and that's why the graph is so empty/odd and it will boil itself down to a natural balanced level. But it didn't. As a result we changed the whole system to a more predictable one. Currently we are still monitoring the sales, but we are setting the insurance prices manually. This period was the left one on the image.

After that things went well, the market collected the prices properly and we were happy campers. Later on before the Terra Incognita patch I managed to put a new bug in, which ruined the graph again. When I realised it there was no way back - we weren't able to deploy a hotfix until the expansion. During this period the repair prices were based on the raw material sales, so the workaround was that I logged in periodically and made some fake sales. FAKE SALES????!!! WHAT?! Yes, I sold and bought randomly one (1) from each raw material so the underlying mechanism had some data to work with. This was enough to keep it going. See the second bugged period on the graph when I had to cheat the price filter mechanism. After that came the expansion and since then it's all good again. You can see these periods in every item's global rates graph.

This is how little bugs can ripple through the system causing trouble.