Vessel Post Mortem: Part 2
In the last episode, our intrepid heroes had discovered that rather than having almost finished the PlayStation 3 port of Vessel, they were barely half way there. Read on to find out what happens next…
With the realisation that both the game and render threads needed to run a 60fps, we had to reassess where we focussed our optimisation efforts. The game thread was doing a lot of different jobs - Lua scripting, playing audio, AI, animating sprites - all sorts of things. There was a lot to understand there, and on top of that, both the physics and render threads still needed more work. But performance wasn't the only concern - there was TRC compliance, game play bugs, new bugs introduced by our changes and, as we eventually discovered, the need for a PC build (more on that later).
Now, Vessel was ‘mostly’ bug free on PC when we started - we did find a few bugs in the Lua and in game code and the few compiler and shader compiler bugs added to that, but there were subtle differences in the way the two platforms dealt with numbers which meant that weird things sometimes happened. For instance, a door on one of the levels would get stuck and your character would have to bump into it in order for it to open. This was due to a variation in position values about 5 decimal places in causing the jamming on PS3. Additionally, there was some code that was used in a number of places that did something like the following;
x = MIN( y/z, 1.0);
What this did on PC was catch the case where z tended to zero and clamped x to 1.0 ( MIN(infinity, 1.0) = 1.0). Unfortunately, on PS3 value/0.0 was QNAN and MIN(QNAN, 1.0) was QNAN. So the clamping never happened resulting in much weirdness in hard to reproduce places (it only occurred when z was zero), and was therefore a bugger to find.
This meant that we weren't just optimising and adding in new PS3 functionality, we were also debugging a large foreign codebase. I hadn't expected that we'd need to change Vessel much - it was a shipped game after all and so I had assumed that it was pretty much bug free.
I had also assumed that for the most part we wouldn't need to change any assets significantly. Things like platform specific images for controllers and buttons were just texture changes and so weren't a problem. Text was fine too. Unfortunately, problems like the sticking door mentioned above meant that we need to go into some of the levels and jiggle bits around. Later in the life of the port we had to tweak fluid flows, emission rates, drains and other fluid based effects to help with performance.
All of this meant that we needed to have the editor up and running, and the editor was built using the same engine as the game so we needed to maintain two builds of the game - one for PS3 and one for PC solely for the editor. This was done crudely by #defineing most of our PS3 changes out and leaving the PC code in there inside the #else conditional. This did slow us down a bit and also made our code that much uglier, but it had the side effect of allowing us to easily test if a bug was present in the PC build or was specific to the PS3.
By the end of September we had fixed many audio issues and managed to get LuaJIT (an optimised version of Lua) working again (it was mostly working but was causing some repeatable problems in some places). Ben also profiled the Lua memory usage (article here) and tweaked its garbage collection to be more CPU friendly (it was occasionally peaking at 13ms in a call). So, slowly, we were scraping back time spent on the game thread. It was still a serious problem, but it was getting better.
Back on the physics system, more and more the CPU code was being ported to the SPUs - in many ways, this was getting easier as I was building a lot of infrastructure which made it more convenient to port. Many of the patterns that I used for one system would translate well to others. I started reducing the amount of memory used by parts of the physics, like dropping the size of the WaterDrop struct to 128 bytes from 144 bytes meant better cache usage and simpler processing on SPU (nicely aligned structs meant that I could assume alignment for SIMD processing).
Some optimisations were straightforward. For example, I changed the way that we processed drops on fluros skeletons. Instead of iterating over the entire drop list and checking each drop to see if it was part of skeleton ‘N’ I changed it to pass over the drop list once, partitioning the drops into buckets of drops per skeleton. Then we could just process a given skeleton, touching only those drops on that skeleton. Obvious in hindsight, and unnecessary on the PC build, but it takes time to get to the point where you really start to understand the code and can begin to change the way it works (rather than optimise a function to perform the same algorithm, just faster).
At this time we also saw the transition of the render thread to the status of low hanging fruit. So we started working on that - unrolling loops, prefetching, minimising data copying. Tweaking, poking, prodding and breaking, testing, measuring, then swearing or cheering and checking in.
The end of September rolled around and we still had no real end in sight. I kept talking to Strange Loop, trying to keep them up to date while we furiously worked to finish this game. As we’d already blown the initial quote, our contract specified that we would have reduce our daily rate for the remainder of the project, to be recouped on sales. So not only did we have significant time pressure, we now had financial pressure on top of that.
October was very productive - it saw over 150 bug fixes and optimisations checked in for the month. As the game was improving in performance and becoming more playable James, our QA guy, was able to play more and more of it. We hadn't actually managed to do a complete play through of the game by this stage, so we focussed heavily on removing all blocking bugs and getting the game in a functionally shippable state (assuming that we would continue to make performance improvements).
Second week in, after discussions with Strange Loop Games we decided to postpone the release of the game to January. This meant submitting the game to Sony in Europe (SCEE) and in America (SCEA) by the end of the year, a mere 4 months after our initial estimate. There was no way I was going to let this date slip again.
Ben was still working on optimising the Lua execution and audio (which at this stage was still taking 4 or 5 ms per frame with peaks way higher than that). I’d thought that as Vessel used FMOD that it would all just work when I turned it on. Unfortunately, the audio was a complete mess on the PS3. While profiling the game thread in detail we discovered that it was taking up huge amounts of time, spiking and causing frame stutters as well as just plain behaving badly. It took weeks of intermittent work plus lots of discussions with the FMOD boys to figure out what the problems were. Turns out, many of the sounds were just set up incorrectly for the PS3 build - effects and layers that were disabled on PC were left on for PS3. Some sounds would loop infinitely and multiple sounds would play at the same time. All these things took time to discover, time to investigate a solution and time to fix. Something we had very little of.
To make things worse, Vessel performed no culling for the rendering, AI or audio. At all. The entire level was submitted to the renderer which culled it and then passed off the visible bits to the SPU renderer. Also, Lua scripts (which are slow at the best of times) were being called even when the things they were modifying weren't in view. Audio was also being triggered by objects that were a long way from the player, relying on distance attenuation to reduce volume. All of this meant that there was a lot of work going on under the hood for no visible or audible impact.
We were still stumbling across weird little issues. For example, the animated textures in the game were never animating. They were loading correctly, but never changing. Ben tracked this down to an endian swap function which was broken (the PC and PS3 have different byte ordering for their numbers and so the same data when loaded from file must be modified depending on platform). This endian swap was only ever called in one place and only by this particular animated material.
The fix is often easy. Finding what to fix is hard.
Even though we had a tight timeline, we also had other client obligations which saw us each lose 2 weeks in November. This helped financially (well, it would have if they had paid on time) but put even more time pressure on us. Regardless, we were confident that we’d make the end of December deadline. Plenty of time.
We finally managed to get a complete play though of the game in November. Only to realise that the endgame didn’t work. Fix, patch, hack, test.
It was during the last week of November that Ben had a breakthrough. He bit the bullet and implemented a simple container based culling system for the game thread. All objects in a loaded level were placed within a rectangular container and there were an arbitrary number of these per level. This meant that we could quickly check what should be rendered or processed (Lua or audio) by looking at the container in view and those next to it. Conceptually simple, but implementation required that we modify the editor (which we’d not done before) and renderer, then visit each level in the editor and manually add the new containers then re-export everything.
This fix made a massive difference to performance. Audio was faster as it wasn’t playing as many sounds. There was less Lua processing happening as there were fewer things active. And rendering was faster as there was less being sent to the render thread. So it worked, and it worked well. In fact, we had to limit the frame rate to 30fps as some places were now rendering at 60fps on all threads!
The container fix was instrumental in getting the game thread to a more playable speed, but it also introduced a lot of little bugs that took a month to fix. And still, it was too slow.
With only three weeks until the Christmas break we figured we needed to start cutting back on the content in some of the levels in order to get the frame rate up. We picked a few of the worst and closely examined them to try and figure out how we could speed them up. We found the following;
- Trees were very costly (as were their roots). So we cut some of them down.
- Drains were expensive, so we removed some of them.
- Areas with lots of water were obviously expensive, so we reduced the amount of water in some areas, cutting back on the flow or adding drains to reduce the amount of water in pools.
- We tweaked the object limiter code to ensure that there were always enough fluros to finish a given level, yet not too many to make it run too slow. Same with the number of drops and seeds.
All of the above helped, and the game was now running pretty well. There were still areas where it would slow down, and you could (and still can) slow down some areas intentionally by spraying lots of water and generating lots of fluros, but there was no time left for further optimisations. I'd already built 13 different SPU tasks to speed the physics up and one or two for the render thread - it was getting very hard to speed things up more, and at this stage it was risky to make any extra significant changes. This would just have to do.
James was now noticing some play specific bugs. Some fluros weren't moving correctly - jumping too high and destroying themselves on the scenery or just behaving badly. Which is fine in most cases, but this was happening with a key fluro that you needed to use to complete the level. We had to modify the level to make it work again.
In addition to the optimisations that we were still implementing, and the bug fixes for the odd things that were happening, we also had to make sure that the game was TRC compliant. James trawled through that massive missive, highlighting things that we needed to check - the ways that save/load games functioned, how the game shut down, what would happen in the case of no hard drive space left, the order that trophies unlock, so many things. And so little time.
And, on top of that, the financial pressure on the company due to the length of time that Vessel was taking to port, plus the reduced pay rate we were getting for the overage and the fact that there was very little work lined up for the new year meant that I had to notify Ben that we were going to have to let him go.
It was one of the hardest things I've ever had to do. I felt like I’d betrayed him - he is an awesome coder and he’d become a good friend. And just before Xmas FFS. To make things worse, the day I had to do it he was working from home and I had to let him know over Skype. It was just awful.
He called me back within a couple of hours to tell me he had just managed to land a new job. In two hours. I told you he was good.
From the left: Tony, Ben and James.
So, back to the code. With a week and a bit to go we mainly focussed on the remaining TRC issues. The game was running as good as we were going to get it to in the time we had and I was satisfied with that. TRC violations were disappearing and it looked like we might just make it.
In the last week (on the Wednesday) I realised that the file format that we’re using needed to be DRM enabled in order to be TRC compliant, so I spent 17 hours straight trying to fix it. Did my first all nighter in a long time and managed to solve the problem at 5am. I pushed through the next day, fixing, tweaking, hacking, testing while full of sugar and caffeine and then dragged myself home to sleep on the Thursday night.
The final Friday arrived. We performed some clean ups and readied the code for submission. We were going to make it! On one last pass over the TRC checklist I realised that we’d missed something. We needed to be able to report an error if any of the required startup files were missing, but the way that the engine was designed, there was no way to display anything until those critical files were loaded. We had a few hours so I quickly hacked together an emergency renderer that I could use in the case of a critical failure. I gave myself 2 hours to do it and it was working within 30 minutes. Awesome! But, upon further testing, the loading screens refused to work. I still have no idea why - the code I added was never executed and yet still affected the loading screens. The game itself played fine, but those bloody loading screens didn't bloody work. I wasn't willing to submit with what I knew would be a TRC failure so, once again, we’d missed our deadline.
We'd have to delay submission until next year.
Continued here: Vessel Post Mortem: Part 3