Overbyte - Latest blog entries http://overbyte.com.au/index.php/overbyte-blog/latest Tue, 20 Nov 2018 23:21:46 +1030 en-gb Vessel Post Mortem: Part 3 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-3 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-3 This is the final part of the Vessel Porting Trilogy. The previous parts can be found here

After the Christmas break (where I was forced into yet another family vacation) I dove straight back into the code, hoping to get the final issue fixed and the game submitted as soon as possible. There was still time to submit and if I passed Sony’s submission process first time then we could still have the game on the PSN store by the end of January. I fixed the remaining problem with the start up error reporting and added in a little more error checking just to be sure. I was pretty paranoid about the submission process as I didn’t want to delay the game anymore by failing - I carefully checked everything twice, crossed my fingers and hit the ‘submit’ button to Sony America (SCEA) on January 9.

In parallel to the code submission process there is a ‘Metadata’ submission process. This corresponds to the PlayStation Store assets and the game’s presence there. It consists of all the text, images, trailers, prices etc and they all have very specific requirements that must be met in order to pass submission. James (our QA guy) managed most of this and involved communicating with Strange Loop’s art guy Milenko (who was incredibly responsive - I’m not sure he ever sleeps) and consisted of asking for different resolution images and screenshots, as well as sourcing the different language versions of the game text. It took us a few submissions of the meta data to get it all right, but the turnaround was pretty quick and a continuous dialog with the Sony helped a lot.

The code submission process consists of uploading the game in a special package format plus some extra documents that describe the trophies and other bits and pieces. We had to submit to both SCEA and Sony Europe (SCEE) so we could have the game released in both those regions. We hadn’t submitted to SCEE at the same time as SCEA as we were still waiting on some publisher registration details to come through, so all I could do was wait for that and for the response from SCEA on our initial submission while I busied myself with some other work.

On January 18th I received the first fail from Sony. As fails go, it was a pretty good one. There were three “Must Fix” bugs: one was due to my entering the wrong data in a field when submitting the game package, and two were due to saving games failing when there was no disk space left. They’d also mentioned some slowdown in some of the levels - I’d expected that, and as it wasn’t a critical bug, I ignored it. The save game bugs proved troublesome - Ben had written all of the save game code and with him gone I now had to learn how Sony wanted you to do it, how Ben had done it and how I could make it work correctly. It took me a few days to find and fix the problems and by this time the details that SCEE required had come through so I resubmitted to SCEE and SCEA at the same time (January 24)

I was quietly confident that it’d all pass this time. I’d thoroughly tested the save game code now and it all worked. What could go wrong? I seriously considered heading out and buying a bottle of Chimay Blue as a pre-emptive celebratory reward.

Image from https://bonbeer.wordpress.com/category/places/mechelen/

The first fail I saw came back from SCEE on February 2nd. I’d messed up the submission data again (I hate filling out forms) plus there was a legitimate bug they’d found where once you’d finished the game you couldn’t continue on from that save game if you wanted to continue playing to complete all the trophies. When I had fixed that I re-submitted the new build to SCEE on February 4th and then began to wonder what had happened to the SCEA build - it should have come back at about the same time as the SCEE one. I also worried that they would pass it with the “continue when finished” bug in it. I needn’t have worried, as the SCEA team came up with a whole slew of new bugs that weren’t mentioned in any of the previous tests.

This new bug report that SCEA had sent back was very concerning. The crashes they were reporting sounded like bugs I’d previously fixed - I was also impressed with their thoroughness. One of the bugs was something like “Hang from the rope in level blah and throw seeds for at least 5 minutes into an area with no water. Crashes 70% of the time”.  While looking at this bug I received notice from SCEE that they had passed Vessel. Which was great, but it was also worrying as I now had numerous repeatable crashes in the build - so I failed the SCEE build myself in order to be able to submit a fixed build which would match the SCEA build.

By now I was getting a little frazzled. It was like one of those movies where the protagonist has slain the evil uber zombie, only to have it rise from the dead over and over again. Would this submission process never end?

I eventually tracked the crash bugs down to a simple stupid error. There was a section of code that was responsible for limiting the amount of fluid, fluros and seeds in a given section. When the frame rate dropped below 60fps, this code would kick in and drop the number of drops/fluros or seeds to the minimum value deemed suitable for that level. During the final stages of development we had created a FINAL_RELEASE mode which turned off all of the unnecessary debugging code. Unfortunately, this erroneously included some code which updated the time values that the limiting code used, so the fluid, fluros and seeds were never being reduced. Ever. This meant that the game would have been running much slower than it should have, and was prone to crashes whenever certain hard coded limits were exceeded. I've never been so relieved to fail anything.

Something I’d like to focus on here is the level of performance that Vessel was submitted with; Ben and I spent a huge amount of our time just trying to squeeze as much performance out of this game as possible on the platform we had. For the most part, I'm pretty happy with the results. Yes, you can abuse the game and make it slow down quite easily - if you try to fill a room with fluros and water and seeds then you’ll most likely see some slowdown - but during normal play, the game maintains 60fps simulation at 30fps frame rate (the graphics is 30fps, so there are two simulation passes per visible frame). However there are a few levels where the sheer number of fluoros and fluid required to solve the puzzles means that the frame rate will drop to 20fps. Given another month I reckon I could have fixed that too, but given how far over time and budget we were, that wasn't an option. So, while I’m happy with how much we improved the frame rate of the game, I'm still a little disappointed that we didn't hit a full 60fps simulation everywhere.

Sure, the PlayStation 3 hardware did slow down the porting of the game initially, but given the complexity of the fluid simulation required I doubt that there will ever be an Xbox 360 version released. The SPUs allowed us to perform at least 80% of the physics simulation completely in parallel - with no impact on the main processor at all. There is no way you’d get the same level of performance on the X360.

With the FINAL_RELEASE bug fixed and the game resubmitted to SCEE and SCEA (which passed on the 20th and 22nd respectively) I was finally free! The metadata was all sorted and we’re now expecting the release on March 11 in the SCEA territories and March 12 in the SCEE regions. We’re all hoping it does well enough to cover our costs (as all devs with newborn progeny are).

# What went right

Having a QA guy in house was invaluable. An impartial, non-technical third party to throw our fixes at kept us honest. As a developer you get very close to the game and like to assume that your fixes have made a positive difference and you’re moving forwards. That’s not always the case.

Honest communication with the Publisher/client. I tried to keep John at Strange Loop Games constantly up to date with the progress (or lack thereof) of the game. He was incredibly understanding with the continuous delays, and while it was hard to deliver bad news, at least it reduced the surprises.

Having good, experienced coders on the job. Bringing Ben onboard was a good choice, even though we had to let him go in the end. There is no way I could have done this without him. Working from the same physical office was also beneficial. I worked remotely for over 5 years leading up to this port, and I think we’d have delivered even later if we’d worked separately.

What went wrong

Work estimation. I fundamentally underestimated the amount of work required. I should have applied Yossarian's Circular Estimation Conjecture and multiplied it by PI - that would have been closer to the mark. The big killer was the misunderstanding with which threads needed to run at 60fps. If it was just the physics I'd underestimated, we probably would have been a shade over a month late (maybe two), but with both the physics and game threads needing to execute in under 16.6ms, we really had our work cut out for us. The amount of time taken for submission shouldn't be forgotten either; 4 to 8 weeks after the initial submission should see your game on the store.

I’d also recommend to anyone doing a console game that they get as much TRC compliance done as soon as possible in the development of the game. Get the load/save stuff working, trophies, error reporting, quitting, controller connection, file error reporting and the like in place and functioning sooner rather than later.

# In Closing

The porting of this game was a trial for me. There was a week or two around September/October where I was really doubting that we could deliver a game that ran at a playable frame rate. I'm proud of what we delivered in the end, and if you’d taken away the financial and time pressures I would have thoroughly enjoyed the challenge. I've learnt a lot from this experience, and I'm looking forwards to the next one.

Just not right now, OK? Maybe a bit later in the year.

]]>
tony@overbyte.com.au (Tony Albrecht) News Tue, 11 Mar 2014 09:00:05 +1030
Vessel Post Mortem: Part 2 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-2 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-2 In the last episode, our intrepid heroes had discovered that rather than having almost finished the PlayStation 3 port of Vessel, they were barely half way there. Read on to find out what happens next…

With the realisation that both the game and render threads needed to run a 60fps, we had to reassess where we focussed our optimisation efforts. The game thread was doing a lot of different jobs - Lua scripting, playing audio, AI, animating sprites - all sorts of things. There was a lot to understand there, and on top of that, both the physics and render threads still needed more work. But performance wasn't the only concern - there was TRC compliance, game play bugs, new bugs introduced by our changes and, as we eventually discovered, the need for a PC build (more on that later).

Now, Vessel  was ‘mostly’ bug free on PC when we started - we did find a few bugs in the Lua and in game code and the few compiler and shader compiler bugs added to that, but there were subtle differences in the way the two platforms dealt with numbers which meant that weird things sometimes happened. For instance, a door on one of the levels would get stuck and your character would have to bump into it in order for it to open. This was due to a variation in position values about 5 decimal places in causing the jamming on PS3. Additionally, there was some code that was used in a number of places that did something like the following;

x = MIN( y/z, 1.0);

What this did on PC was catch the case where z tended to zero and clamped x to 1.0 ( MIN(infinity, 1.0) = 1.0). Unfortunately, on PS3 value/0.0 was QNAN and MIN(QNAN, 1.0) was QNAN. So the clamping never happened resulting in much weirdness in hard to reproduce places (it only occurred when z was zero), and was therefore a bugger to find.

This meant that we weren't just optimising and adding in new PS3 functionality, we were also debugging a large foreign codebase. I hadn't expected that we'd need to change Vessel much - it was a shipped game after all and so I had assumed that it was pretty much bug free.

I had also assumed that for the most part we wouldn't need to change any assets significantly. Things like platform specific images for controllers and buttons were just texture changes and so weren't a problem. Text was fine too. Unfortunately, problems like the sticking door mentioned above meant that we need to go into some of the levels and jiggle bits around. Later in the life of the port we had to tweak fluid flows, emission rates, drains and other fluid based effects to help with performance.

All of this meant that we needed to have the editor up and running, and the editor was built using the same engine as the game so we needed to maintain two builds of the game - one for PS3 and one for PC solely for the editor. This was done crudely by #defineing most of our PS3 changes out and leaving the PC code in there inside the #else conditional. This did slow us down a bit and also made our code that much uglier, but it had the side effect of allowing us to easily test if a bug was present in the PC build or was specific to the PS3.

By the end of September we had fixed many audio issues and managed to get LuaJIT (an optimised version of Lua) working again (it was mostly working but was causing some repeatable problems in some places). Ben also profiled the Lua memory usage (article here) and tweaked its garbage collection to be more CPU friendly (it was occasionally peaking at 13ms in a call). So, slowly, we were scraping back time spent on the game thread. It was still a serious problem, but it was getting better.

Back on the physics system, more and more the CPU code was being ported to the SPUs - in many ways, this was getting easier as I was building a lot of infrastructure which made it more convenient to port. Many of the patterns that I used for one system would translate well to others. I started reducing the amount of memory used by parts of the physics, like dropping the size of the WaterDrop struct to 128 bytes from 144 bytes meant better cache usage and simpler processing on SPU (nicely aligned structs meant that I could assume alignment for SIMD processing).

Some optimisations were straightforward.  For example, I changed the way that we processed drops on fluros skeletons. Instead of iterating over the entire drop list and checking each drop to see if it was part of skeleton ‘N’ I changed it to pass over the drop list once, partitioning the drops into buckets of drops per skeleton. Then we could just process a given skeleton, touching only those drops on that skeleton. Obvious in hindsight, and unnecessary on the PC build, but it takes time to get to the point where you really start to understand the code and can begin to change the way it works (rather than optimise a function to perform the same algorithm, just faster).

At this time we also saw the transition of the render thread to the status of low hanging fruit. So we started working on that - unrolling loops, prefetching, minimising data copying. Tweaking, poking, prodding and breaking, testing, measuring, then swearing or cheering and checking in.

The end of September rolled around and we still had no real end in sight. I kept talking to Strange Loop, trying to keep them up to date while we furiously worked to finish this game. As we’d already blown the initial quote, our contract specified  that we would have reduce our daily rate for the remainder of the project, to be recouped on sales. So not only did we have significant time pressure, we now had financial pressure on top of that.

# October

October was very productive - it saw over 150 bug fixes and optimisations checked in for the month. As the game was improving in performance and becoming more playable James, our QA guy, was able to play more and more of it. We hadn't actually managed to do a complete play through of the game by this stage, so we focussed heavily on removing all blocking bugs and getting the game in a functionally shippable state (assuming that we would continue to make performance improvements).

Second week in, after discussions with Strange Loop Games we decided to postpone the release of the game to January. This meant submitting the game to Sony in Europe (SCEE) and in America (SCEA) by the end of the year, a mere 4 months after our initial estimate. There was no way I was going to let this date slip again.

Ben was still working on optimising the Lua execution and audio (which at this stage was still taking 4 or 5 ms per frame with peaks way higher than that). I’d thought that as Vessel used FMOD that it would all just work when I turned it on. Unfortunately, the audio was a complete mess on the PS3. While profiling the game thread in detail we discovered that it was taking up huge amounts of time, spiking and causing frame stutters as well as just plain behaving badly. It took weeks of intermittent work plus lots of discussions with the FMOD boys to figure out what the problems were. Turns out, many of the sounds were just set up incorrectly for the PS3 build - effects and layers that were disabled on PC were left on for PS3. Some sounds would loop infinitely and multiple sounds would play at the same time. All these things took time to discover, time to investigate a solution and time to fix. Something we had very little of.

To make things worse, Vessel performed no culling for the rendering, AI or audio. At all. The entire level was submitted to the renderer which culled it and then passed off the visible bits to the SPU renderer. Also, Lua scripts (which are slow at the best of times) were being called even when the things they were modifying weren't in view. Audio was also being triggered by objects that were a long way from the player, relying on distance attenuation to reduce volume. All of this meant that there was a lot of work going on under the hood for no visible or audible impact.

We were still stumbling across weird little issues. For example, the animated textures in the game were never animating. They were loading correctly, but never changing. Ben tracked this down to an endian swap function which was broken (the PC and PS3 have different byte ordering for their numbers and so the same data when loaded from file must be modified depending on platform). This endian swap was only ever called in one place and only by this particular animated material.

The fix is often easy. Finding what to fix is hard.

# November

Even though we had a tight timeline, we also had other client obligations which saw us each lose 2 weeks in November. This helped financially (well, it would have if they had paid on time) but put even more time pressure on us. Regardless, we were confident that we’d make the end of December deadline. Plenty of time.

We finally managed to get a complete play though of the game in November. Only to realise that the endgame didn’t work. Fix, patch, hack, test.

It was during the last week of November that Ben had a breakthrough. He bit the bullet and implemented a simple container based culling system for the game thread. All objects in a loaded level were placed within a rectangular container and there were an arbitrary number of these per level. This meant that we could quickly check what should be rendered or processed (Lua or audio) by looking at the container in view and those next to it. Conceptually simple, but implementation required that we modify the editor (which we’d not done before) and renderer, then visit each level in the editor and manually add the new containers then re-export everything.

This fix made a massive difference to performance. Audio was faster as it wasn’t playing as many sounds. There was less Lua processing happening as there were fewer things active. And rendering was faster as there was less being sent to the render thread. So it worked, and it worked well. In fact, we had to limit the frame rate to 30fps as some places were now rendering at 60fps on all threads!

The container fix was instrumental in getting the game thread to a more playable speed, but it also introduced a lot of little bugs that took a month to fix. And still, it was too slow.

# December

With only three weeks until the Christmas break we figured we needed to start cutting back on the content in some of the levels in order to get the frame rate up. We picked a few of the worst and closely examined them to try and figure out how we could speed them up. We found the following;

• Trees were very costly (as were their roots). So we cut some of them down.
• Drains were expensive, so we removed some of them.
• Areas with lots of water were obviously expensive, so we reduced the amount of water in some areas, cutting back on the flow or adding drains to reduce the amount of water in pools.
• We tweaked the object limiter code to ensure that there were always enough fluros to finish a given level, yet not too many to make it run too slow. Same with the number of drops and seeds.

All of the above helped, and the game was now running pretty well. There were still areas where it would slow down, and you could (and still can) slow down some areas intentionally by spraying lots of water and generating lots of fluros, but there was no time left for further optimisations. I'd already built 13 different SPU tasks to speed the physics up and one or two for the render thread - it was getting very hard to speed things up more, and at this stage it was risky to make any extra significant changes. This would just have to do.

James was now noticing some play specific bugs. Some fluros weren't moving correctly - jumping too high and destroying themselves on the scenery or just behaving badly. Which is fine in most cases, but this was happening with a key fluro that you needed to use to complete the level. We had to modify the level to make it work again.

In addition to the optimisations that we were still implementing, and the bug fixes for the odd things that were happening, we also had to make sure that the game was TRC compliant. James trawled through that massive missive, highlighting things that we needed to check - the ways that save/load games functioned, how the game shut down, what would happen in the case of no hard drive space left, the order that trophies unlock, so many things. And so little time.

And, on top of that, the financial pressure on the company due to the length of time that Vessel was taking to port, plus the reduced pay rate we were getting for the overage and the fact that there was very little work lined up for the new year meant that I had to notify Ben that we were going to have to let him go.

It was one of the hardest things I've ever had to do. I felt like I’d betrayed him - he is an awesome coder and he’d become a good friend. And just before Xmas FFS. To make things worse, the day I had to do it he was working from home and I had to let him know over Skype. It was just awful.

He called me back within a couple of hours to tell me he had just managed to land a new job. In two hours. I told you he was good.

From the left: Tony, Ben and James.

So, back to the code. With a week and a bit to go we mainly focussed on the remaining TRC issues. The game was running as good as we were going to get it to in the time we had and I was satisfied with that. TRC violations were disappearing and it looked like we might just make it.

In the last week (on the Wednesday)  I realised that the file format that we’re using needed to be DRM enabled in order to be TRC compliant, so I spent 17 hours straight trying to fix it. Did my first all nighter in a long time and managed to solve the problem at 5am. I pushed through the next day, fixing, tweaking, hacking, testing while full of sugar and caffeine and then dragged myself home to sleep on the Thursday night.

# Submission Time

We'd have to delay submission until next year.

Continued here: Vessel Post Mortem: Part 3

]]>
tony@overbyte.com.au (Tony Albrecht) News Fri, 07 Mar 2014 08:53:50 +1030
Vessel Post Mortem: Part 1 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 I started looking at Vessel in January 2013 - initially just in my evenings. In my ‘spare’ time. I was looking at it because John and Martin (the founders of Strange Loop Games and the creators of Vessel) had asked me to consider porting it to PS3.  We knew each other  from our time together at Pandemic Studios in Brisbane where I had worked closely with Martin in the Engine team while John worked as an AI programmer. I was keen to do the port but I wanted to be as sure as possible that I knew how much work was involved in it before committing.

I found the idea of porting such a game daunting - it had a custom rendering engine, a bespoke fluid physics system with major parts of the game code written in Lua - but I figured it was time to step up and take on such a challenge. I have an immense amount of respect for John and Martin - they are both incredibly smart coders and I relished the chance of working with them again. I wanted to do it but I didn't want to let them down. They were going to be my clients, but they were also friends.

I started by looking at their code - in order to give a quote on the porting process I needed to have the game running so that I could profile it. Vessel had already been partially ported to the PS3 by a third party, but wasn't compiling when I first got my hands on it. So I spent a few weeks of long evenings fixing the FileIO, the template compilation errors (GCC and MSVC disagree on some of the finer points of templates), and just commenting out anything I didn't think was relevant in order to get the game up and running. Once it was working, I ran my trusty profiler over the game and tried to guess how much time it would take to get this baby to a shippable state. What I found was a little scary.

Visualisation of the code changes made during the porting of Vessel to PS3.

Vessel runs predominantly on three threads. The game thread which runs all the Lua scripts which control the majority of the game logic and in game objects, the render thread which takes the output of the game thread and converts it to a format that can be fed to the SPUs which then build the graphics command buffer and feed it to the GPU in parallel, plus the physics thread which manages all the physics processing - fluid dynamics, rigid body collisions, everything. John had built the physics system himself and it was pretty impressive (and complicated). It was too much to grok initially, so all I could do was look at the game’s performance on some of the worst levels, profile it there and see how much I needed to speed it up.

I investigated a few of the levels on the PS3 build and picked two that I thought were the worst. What they told me was that the physics thread was the primary cause of slowdown and so I concentrated my analysis on that thread. The game was running at about 60ms per frame in the worst levels I was testing - about 15 to 20 frames per second. From discussions with Strange Loop, I knew that the physics thread had to run at 60fps so this meant that I needed a factor of 4 speed up. From experience, I knew that this was something that was achievable. The game thread was running at around 30ms per frame and the Render thread was clocking in at around 5ms per frame. SPU rendering was pretty fast, and I figured that wouldn't be much of a problem.

Months of fannying about took place and a contract was finally signed in June. So I did the only appropriate thing one could do in that situation - I did two weeks of work on Vessel and then went on a 6 week family holiday to the UK.

In anticipation of the Vessel contract I’d hired an experienced programmer, Ben Driehuis. Ben came highly recommended and had worked on the Bioshock series of games, plus, most importantly, he was living in the same city as me. Having spent the last 6 years working remotely I knew that my best bet of delivering this game on time was to be in the same physical space as my team. I threw Ben in the deep end. He was hired and then given work on platforms and engines that he’d never had to deal with before. Then when I buggered off to the UK he had to manage by himself, dealing with a performance contract on a big name title and then kicking on with Vessel in my absence. Poor bugger.

I also had a friend who was interested in the QA side of things, so James (an old Uni mate) was responsible for playing Vessel over and over and over again, reporting bugs to us and telling us that the performance improvements we’d just made hadn't made any difference. James also helped us out on the testing for the TRC (Technical Requirements Checklist - a massive list of tests that need to be checked off in order to get your game through submission to Sony) and the actual submission of the game.

Once I got back from my holiday we moved into the new office and started working in earnest on the Vessel port. Our goal was to finish Vessel by the end of August - we had about four man-months to do the work, and I had carefully calculated how much time each section of the port would take. Surely that was enough time?

Ben and I divided up the work between us: I took on the optimisation of the physics and Ben was to deal with pretty much everything else - audio was broken, some of the in game objects didn't animate, loading needed to be optimised, memory usage had to be reduced, we needed save games and trophies and a whole swathe of PS3 specific features implemented. Ben has written about some of the fun he had during the port here, here and here.

From my initial inspection of the code, I had noticed a lot of memory allocations taking place, so I was optimistic that I could reduce (or remove) these and scrape back a lot of frame time. I also figured that rearranging the memory usage to be more cache friendly would help the PlayStation 3’s main processor so I could leave the core of the physics code alone. The last thing I wanted to do was to port all of the physics to SPU.

Three weeks later I started porting the physics to SPU. The changes I was making were incrementally improving the performance, but I could see that we’d never hit the speed we needed without dramatic changes. Porting code to SPU isn’t like porting code from one platform to another. You can program SPUs in C++, but they have very limited memory available to them - only 256kb each. So, in order to port any significantly complex task, you need to understand how that task uses memory - where it is, what order it reads it, where it writes it - the things that most programmers of high level languages take for granted. Not only that, but you have to consider how the task will parallelise. Running on one SPU is good, but running on 6 is awesome.  (I’ll go into more detail on the SPU porting process in later articles.)

The first SPU task I built was one that looked for drops that were near a set of objects. My first shot at this shaved almost 2ms off the frame time - the function in question dropped from a peak of 2.4ms to 0.6ms! That meant that in order to hit 16ms per frame I just needed to do this 21 more times!

Still, Ben and I kept hammering away at the code. By the end of the first week in August I’d shaved 15ms off the Physics thread and yet that thread remained the bottleneck. Ben had freed up over 200MB of memory and we were close to running on a retail PS3. But we’d started to notice a few things - some of the levers and triggers in the game weren’t working, fluid shaders weren’t working properly, we were getting spikes of over 200ms in some cases and the game still wasn’t playable due to not just the levers not working but the AI of the fluros seemed to be broken (some would teleport away).

It was evident at this point that we weren’t going to hit the end of August deadline. Other projects had taken some time from both Ben and myself, so we still had billable time available going into September, so I let Strange Loop know that things were going to take a little longer than I had expected. They were very understanding and comfortable with the release date pushing out a little more.

Things were progressing well though - by the end of August the physics thread was running at better than 33ms per frame - we were over halfway there! Save games were in place, loading was optimised, and trophies were functioning. Unfortunately, the game thread was slowing down - as Ben was fixing problems with the code and data, more things were being processed. For example, audio wasn’t working before and now there was a lot of audio being triggered so this obviously took more processing time.

It was the second week in September when I realised that I’d made a horribly flawed assumption about the game thread in Vessel. I discovered that it had to function in lock step with the physics thread and so it too had to run at 16ms per frame. Reducing the framerate of the physics wasn’t an option as it was fixed to 16ms per frame - experimentation with varying that resulted in fluid collisions breaking horribly.

Then Ben discovered the reason for the teleporting fluros - a compiler bug triggered only in optimised builds was causing fluros to incorrectly calculate their trajectory during jumps, making them disappear. Turning off optimisation for that function fixed the problem. That was a tricky bug to find, so we were pretty stoked - until we realised that this meant that the number of fluros in view was now higher that it was before. Much higher. More fluros means more processing - more physics, more drops, more rendering, more audio. This one fix resulted in the Physics thread jumping back up to 50ms from around 24ms, the render thread jumping up to 12ms and the Game thread climbing to 40ms.

I was horrified. We were effectively back to where we had started. We now needed to save a total of 60ms over three threads. I wasn't sure we could do it. It was like that bit in horror movies where the protagonist has killed the evil monster, only to have it rise up, stronger than it was before.

The initial optimisations on a project are the easiest. As time goes on and all the low hanging fruit has been picked, the remaining optimisations are harder and harder to perform. We really had our work cut out for us - and we'd come too far to go back now.

Visualisation of the code and data changes made for the Vessel PS3 port.

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Thu, 27 Feb 2014 23:11:19 +1030
Official Vessel Release Date! http://overbyte.com.au/index.php/overbyte-blog/entry/official-vessel-release-date http://overbyte.com.au/index.php/overbyte-blog/entry/official-vessel-release-date Overbyte and Strange Loop Games are proud to announce that the PlayStation 3 version of Vessel is due for release on 11 March in the US regions and 12 March for the European and Australasian regions. The official announcement can be found on the PlayStation Blog

## Minimising the memory bottleneck

So, what can we do about it? We need to use memory, so how do we minimise the impact of these long reads from main memory?

The simplest way is to just use less memory – minimise the size of your regularly used (hot) structures. Do you really need to an array of rarely used data (eg char name[32];)  in the struct or can you just maintain a pointer to it? If it is a large structure, can you break it up into hot and cold parts, or at least parts that are used in sequence?

The third optimisation is to use a special instruction that will start fetching some data for us, ideally well before we need it so that when we do need it, it will hopefully be in the L2 cache ready for us. This is called prefetching and, when used effectively, can have a dramatic effect on your code’s performance.

## Prefetching

On the PlayStation3, the prefetch instruction is __dcbt(loc), where loc is the location of the data to be prefetched. It triggers a load of a single cache line (128 bytes)  into L2 cache. Let’s add one into our loop and see what happens:

Well, our stalls seem to have vanished, but what about the performance? The original code was running at around 10.8ms – this code now executes in 4.5ms! Nice! We’ve sped up our code by a factor of two by adding an instruction.

I chose (p+10) as the location to prefetch as the inner loop is about 40 cycles and fetching 10 particles ahead corresponds to 10*40 = 400 cycles which is approximately the time taken to fetch data from main memory. The graph below shows the different  times taken for different prefetch distances:

The first column corresponds to the code without the prefetch instruction, while the second column prefetches the particle that we are about to use in that iteration – note this indicates the cost of the prefetch instruction is non-zero, but very soon the effect of prefetching mitigates that cost. At 8 to 10 particles look ahead (192 to 240 bytes) we start to plateau. Now it is worth noting that the __dcbt() instruction loads in 128 bytes at once, so when we’re calling __dcbt() within our loop, we’re actually prefetching the same cacheline 5 times. This is still cheaper than testing if the next particle to be prefetched is in the same cacheline or not (if we unrolled the loop to do 5 particles (5 * sizeof(MyParticle) = 120)  at once we’d get better prefetch distribution (as well as pipeline utilisation), but we’ll look at loop unrolling in a later lesson).

## The reason for the dual stalls

The reason for the dual stalls in the original code is due to the size of MyParticle relative to the size of the cache line. When a single particle (MyParticle is 24 bytes in size) is loaded, an entire 128 byte cache line is loaded. This means that 5 particles will be loaded in with that one, as well as part of the next one.  The following diagram shows the distribution of particles over 3 consecutive cache lines (the pattern repeats every 3 cache lines).

The first case where the struct aligns exactly with a cache line results in a hit on the first stall as the x velocity is used. The next 4 particles are all within that cache line and so do not incur another cache miss penalty. The first miss happens loading the 6th particle and contributes to the first stall once again – py is in cache, vx is the first miss.  The next miss contributes to the second stall as vx is already in the cache and the processor has to wait while vy is fetched from main memory. This behaviour matches our observations with approximately 2/3 of the hits being on the first stall and 1/3 on the second. When we added prefetching both of the stalls were removed (as we would expect).

## Summary

What we’ve seen here is how to identify and fix a simple system which is preforming sub-optimally due to cache misses. The memory used is already contiguous and homogeneous, so prefetch look aheads are easy to implement. More complex memory access patterns can be considerably more difficult to optimise, so I would suggest that one of your first passes at optimising a system (after algorithmic optimisation) is to simplify your memory access. Determining what data is used in a tight loop and then ensuring that data is stored in a cache friendly fashion will not only give you an instant speed boost, but will make it easier to prefetch and thereby optimise further.

When designing a system that needs to be fast, it is critical that you consider the layout of the memory being used. Optimal code is irrelevant if your data is badly organised.

TLDNR: put all your data in flat arrays.

# Coming up next

Now that we’ve removed the memory bottleneck, we’ll next have a look at further optimising this code by addressing the instructions used. In particular, we’ll start using some SIMD instructions and see how that impacts our performance.

Until then, remember, it’s all about the memory.

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Fri, 23 Mar 2012 15:46:00 +1030
Optimisation Lesson 2: The Pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline

Before we start on the second in this series of optimisation lessons I’d like to make something clear. The type of optimisation that I am predominantly talking about in these lessons is low-level optimisation, or micro-optimisation.  This type of optimisation should only be applied once you are sure that it will definitely have a benefit, once you have exhausted all obvious higher level macro-optimisations. There have been (too many) times in my coding career when I have spotted some obviously inefficient code and spent too much time joyously optimising it only to end up with rigid, highly optimised, unmaintainable, very clever code that makes absolutely no difference to the overall performance. Or, even worse, it runs slower.

Your first task in optimisation is to figure out what is slow, then why that something is slow (and sometimes when) and then remedy that.

## Understanding the sample distribution

If you can cast your minds back to Lesson 1, we had profiled our particle system and had started examining the PC samples obtained from our profiler as depicted below.

I’ve highlighted the two lines with the highest hits and we’ll start by focusing on the second, biggest one first.

 blt 0x00011B30 

From the interactive C++ to ASM page from the last lesson we can see that this is a branch command and corresponds to the following C++ code:

 if(p.m_Position<0) 

To understand why this instruction is such a bottleneck we need to understand a little about how the underlying hardware functions – in particular the CPU pipeline. Modern CPUs can issue a command or two every cycle. These instructions take a variable number of cycles (depending on the instruction) to produce an output, but (and this is important), multiple instructions can be in flow at the same time – each one at a different stage in the pipeline.

For example, consider the following code which does 3 single precision floating point adds:

 fadds f4,f4,f0   fadds f13,f13,f30   fadds f0,f31,f29 

If each fadds instruction takes 10 cycles, how many cycles does it take to complete these three instructions?

The answer, as with most things when you get deep enough, is “It depends”. In order to simplify this problem let’s assume that the registers that are being used already have the relevant data in them – this way there is no stalling as the fadds instructions wait for the registers to fill from D cache or main memory (we’ll deal with that can of worms in the next lesson).

The following diagram shows some of the details of the PPU pipeline. It’s from this article and is definitely worth reading if you are at all interested in the Cell BE architecture.

Part of the PPU Pipeline

Each box in this diagram corresponds to a cycle – for example, the first 4 boxes, IC1 through IC4 correspond to the time taken for the processor to fetch instructions from the Instruction Cache –  4 cycles –  followed by a two cycle dispatch and then three cycles to decode and three more to issue the instruction to either the Branch Unit, Fixed Point Unit or the Load Store Unit. All instructions have to go through this 12 cycle front end, but if this pipeline is fully populated then a new instruction can be dispatched every cycle. The time taken for an instruction to enter, pass through and then leave a pipeline is called it’s latency. I won’t go into too much depth on the details of CPU pipelines here – check your hardware documentation for more information (for consoles this is generally under NDA anyway) and/or have a read of this Ars Technica article.

Back to the case in hand: The FPU pipeline is 10 cycles long which means it will take 10 cycles for a single instruction to pass through that pipeline (the time taken to get the instruction to the FPU can be ignored for this calculation). A new FPU instruction can be issued every cycle after that; this means that f0 from the third fadds will be ready after cycle 12.

It is worth noting at this point that some instructions (such as divide and square root on the PPU), do not pipeline at all. This means that all the instructions after that command will stall until it has finished, reducing throughput significantly (some non-pipelined instructions take more than 80 cycles).

The more attentive amongst you may have noticed the reuse of the register f0  in the third fadds – this is not an issue as once an instruction is committed into the pipeline the input registers are no longer relevant to that instruction. The compiler will take care of dependencies for you, so you shouldn’t need to worry about it, merely be aware of it.

## Branching

“So,” I hear you ask, “what does this have to do with the branch statement in the code we’re looking at?”

Well, I’m getting to that. Branching can be an expensive operation and the PPU tries to minimise its impact by trying to predict which way the code will branch. It does this either statically (the compiler or programmer specifies which code path is most likely taken) or dynamically (the PPU keeps a table of branches and which path the most recent branches have taken and uses that to infer the most likely branch).  Check out Mike Acton’s article on branching for more details.

A mispredicted branch will incur a complete pipeline flush which is quite expensive (I’ve not mentioned the exact value for the PPU as I’m not sure if it is covered by NDA but you should be able to infer it from the earlier pipeline diagram) and even an accurately predicted branch has a slight cost.

If we once again look at the performance graph from the first lesson we are now in a position to understand what is happening to produce this peak on the right:

Graph of execution time taken over frame number (from Lesson 1).

The flat part of the graph on the left corresponds to the expansion of the particle ball while all particles are above the collision plane. When the particles start to collide with that plane, the time taken increases dramatically, rising from around 15ms per update to nearly 60ms (when nearly all the particles are at rest on the collision plane). This increase is processing time is due to not just the extra processing when a collision occurs (three fp multiplies and a fp negate), but also due to the branch and (potential) associated pipeline flush.

## Removing Branches

Fortunately, there is an instruction that can help us here; fsel (floating point select). This instruction provides a way of effectively performing our own speculative execution and then choosing one of two operands based on a third (see below):

 d = __fsel(a,b,c); // is the same as the following code but with no branch instruction if(a>=0)      d = b else     d = c; 

The other benefit of this instruction is that it is pipeline friendly – it also takes 10 cycles and our pipeline is left intact once it has completed. And we can have multiple of these instructions in the pipeline at the same time.

So, let’s rewrite our particle code without any classes (to make it easier to see how it maps to asm), pre-emptively calculate the bounced position and velocity and pick the resultant position and velocity via fsel() commands based on the y value of the particle’s position.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 void ParticleManager::Update(float fClock) { const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0.0f, -9.8f, 0.0f) * fClock; // a*t const MyVector at2 = v3Acceleration*fClock*0.5f; // ½a*t^2   MyParticle *p = &m_Particles[0]; MyParticle *pEnd = &m_Particles[m_NumParticles]; for (; p<pEnd; p++) { float newPosy = p->m_Position.y; float newPosx = p->m_Position.x; float newPosz = p->m_Position.z; float newVelx = p->m_Velocity.x; float newVely = p->m_Velocity.y; float newVelz = p->m_Velocity.z;   newPosy = newVely * fClock + newPosy + at2.y; newPosx = newVelx * fClock + newPosx + at2.x; newPosz = newVelz * fClock + newPosz + at2.z; newVelx = newVelx + v3Acceleration.x; newVely = newVely + v3Acceleration.y; newVelz = newVelz + v3Acceleration.z;   float bouncedPos_y = -1.0f*newPosy; float bouncedVelx = newVelx*dampVelocity.x; float bouncedVely = newVely*dampVelocity.y; float bouncedVelz = newVelz*dampVelocity.z;   // choose vel based on y pos newVelx = __fsels(newPosy,newVelx,bouncedVelx); newVely = __fsels(newPosy,newVely,bouncedVely); newVelz = __fsels(newPosy,newVelz,bouncedVelz); // set y based on y pos newPosy = __fsels(newPosy,newPosy,bouncedPos_y);   p->m_Position.x = newPosx; p->m_Position.z = newPosz; p->m_Position.y = newPosy;   p->m_Velocity.x = newVelx; p->m_Velocity.y = newVely; p->m_Velocity.z = newVelz; } } 

And here is the assembly generated by compiling the above code.

 0x000003D0 bge 0x00000458 0x000003D4 lfs f9,0x4(r4) 0x000003D8 lfs f10,0xC(r4) 0x000003DC lfs f11,0x10(r4) 0x000003E0 lfs f12,0x14(r4) 0x000003E4 fmadds f9,f1,f11,f9 0x000003E8 lfs f13,0x0(r4) 0x000003EC fadds f0,f10,f5 0x000003F0 lfs f31,0x8(r4) 0x000003F4 fadds f11,f11,f4 0x000003F8 fadds f30,f12,f5 0x000003FC fmadds f10,f1,f10,f13 0x00000400 addic r5,r4,0x18 0x00000404 fmadds f12,f1,f12,f31 0x00000408 fmadds f13,f6,f2,f9 0x0000040C fnmadds f9,f6,f2,f9 0x00000410 fmuls f31,f8,f0 0x00000414 fmuls f29,f7,f11 0x00000418 fmuls f28,f8,f30 0x0000041C fadds f10,f10,f3 0x00000420 fadds f12,f12,f3 0x00000424 fsel f9,f13,f13,f9 0x00000428 fsel f0,f13,f0,f31 0x0000042C fsel f11,f13,f11,f29 0x00000430 fsel f13,f13,f30,f28 0x00000434 stfs f10,0x0(r4) 0x00000438 stfs f12,0x8(r4) 0x0000043C stfs f9,0x4(r4) 0x00000440 stfs f0,0xC(r4) 0x00000444 stfs f11,0x10(r4) 0x00000448 stfs f13,0x14(r4) 0x0000044C clrldi r4,r5,32 0x00000450 cmplw r4,r3 0x00000454 blt 0x000003D4 

If you have a HTML5 compliant browser, then you can interactively walk through the ASM and C++ code in another window by clicking on the following image:

As you can see from this assembly (particularly via the interactive walkthrough), the only branches are associated with the for loop which iterates through all of the particles.

We would expect this code to run faster as there are no more data accesses and there are less stalls in the worst case scenario (all particles mis-predicting a branch), but how much faster? The following graph shows the performance of the code in lesson 1 against this lesson’s performance.

As you would expect, the performance is constant as there is no branching based on the particle’s position, but what you might not have expected is that the baseline performance is better than the original solution. Even though we are calculating the bounced position and velocity in addition to the non-bounced ones for every particle, this code is still significantly faster than the branching one. Basically, we are doing more work in fewer cycles. The removal of the branch removes the need for a compare (which stalled waiting for the result of the increment of the y position) as well as the branch and, thereby allows better pipelining of the inner loop.

This is a pretty good performance improvement for a fairly small change, but finding wins like this in production code is rare. This type of optimisation really only benefits very tight inner loops; using branchless logic in higher level code will usually have no measurable performance improvement at the cost of obfuscating your code – so choose wisely.

The important things to take away from this lesson are:

1. Instructions pipeline. Taking advantage of that can give you big performance wins
2. Branching can be bad on certain architectures (primarily because it trashes the pipeline)

## Coming up next

So where to next? Let’s have a look at the PC samples of this new code and see where the bottleneck has moved to:

Samples attributed to C++ code

Most of the samples in this code seem to have coalesced around the calculation of the y position of the particles. The reason for this is very interesting and we’ll address this in the next lesson.

So endeth the second lesson. Please leave a comment if you’ve spotted an error, need something clarified or would just like to leave some feedback.

 I’ve just been pointed at an excellent article on pipelines and branchless maths: “Down With fcmp: Conditional Moves For Branchless Math”

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Thu, 10 Nov 2011 15:43:00 +1030
Optimisation Lesson 1: Profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling

The performance of a game is critical – it defines the scope of a game. It specifies how many, how far, how much detail, the level of freedom, even the length of the project. Whether you are an artist, designer, producer or programmer, the performance of a game has an impact on you. As Jeff Atwood recently stated, “Performance is a feature” and as with all features, they need to be scheduled, researched, implemented and maintained.

This article is the first in a series based on the Masterclasses which I’ve recently presented for Game Connection, and will walk you through several phases of optimisation performed on a simple piece of code. It is primarily aimed intermediate programmers  or experienced programmers who haven’t had much to do with performance and low level programming. The aim is to give a detailed example of finding performance bottlenecks and then addressing them. It should give you and idea of how code can be optimised and the magnitude of performance improvements you can expect from different changes. The series of lessons will cover profiling, branch removal, cache aware programming, loop unrolling and SIMD programming.

## Introducing A Simple Particle System

The first piece of code we’re going to look at is a very simple particle system. We will look at the update phase only, ignoring the rendering and GPU performance for this example considering only the CPU for now. The CPU in this case is the PlayStation®3′s PPU and external profiling is done with SN System’s excellent Tuner.

There are a couple of simple classes, the vector class  MyVector and the particle class, MyParticle.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 class MyVector { public: MyVector(){}; MyVector(float X, float Y, float Z) {x=X;y=Y;z=Z;}   inline const MyVector& operator+=( const MyVector& rhs ); float x,y,z; };   inline const MyVector &MyVector::operator+=( const MyVector& rhs ) { x += rhs.x; y += rhs.y; z += rhs.z; return *this; }   inline MyVector operator+( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z ); }   inline MyVector operator*( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x * rhs.x, lhs.y * rhs.y, lhs.z * rhs.z ); }   inline MyVector operator*( const MyVector& lhs, float rhs ) { return MyVector( lhs.x * rhs, lhs.y * rhs, lhs.z * rhs); } 

and the particle class;

 1 2 3 4 5 struct MyParticle { MyVector m_Position; MyVector m_Velocity; }; 

The particles in this system will accelerate under gravity and will bounce on the y equals zero plane with some damping applied to approximate energy lost in collision. If we can remember back to our high school maths, the equations of motion are;

${v}' = v + at$

${p}' = p + vt + \frac{1}{2} at^{2}$

So, the update loop for the code is:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 void ParticleManager::Update(float fClock) { int iParticle; const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0, -9.8, 0) * fClock; // a*dt   for (iParticle=0; iParticle<m_NumParticles; iParticle++) { MyParticle &p=m_Particles[iParticle]; // v' = v + a*dt p.m_Velocity += v3Acceleration;   // Update the particle position // p'=p + v * t + 0.5 * a * dt^2 p.m_Position += p.m_Velocity*fClock + v3Acceleration*fClock*0.5f;   // if Particle is below the ground plane, then damp and reverse the velocity if (p.m_Position.y<0) { p.m_Position.y*= -1; // reflect position along y p.m_Velocity= p.m_Velocity*dampVelocity; } } } 

This code gives us the following output (for 256k particles)

A simple particle system in action.

## Step 1: Profile

The first thing we need to do is figure how fast this code is running, and for that we need a useful metric. Frames per second is borderline useless and a percentage of a frame is just as bad. Use a sensible metric like milliseconds or microseconds – then you can then accurately measure the length of a function independent of the frame rate. Also, if possible, capture profiling information over a number of frames. This will allow you to find spikes and patterns in the execution of your code. When optimising code, the spikes are generally what need to be addressed first (and are often the easiest to fix too).

Consider the following graph of this particle system. The vertical axis is milliseconds taken to process all of the particles and the horizontal axis is time (# frames)

Graph of execution time taken over frame number.

Note the shape of the graph – it starts at around 17ms per frame and then dramatically sweeps up to sitting at about 60ms per frame. There is obviously something interesting happening here that requires further examination. Lets run a profiler over this code and see if we can get a better idea of what is happening with this example.

There are two main types of profiling that are generally used; instrumented and sampling.

### Instrumented Profiling

Instrumented profiling relies on either manual or automatic insertion of code that stores a time stamp at set points for later analysis. Manual instrumentation relies on the programmer inserting code which wraps the code to be timed, for example, at the start and ends of a function.

 73 74 75 76 77 78 79 80 81 void FunctionThatIsTakingALongTime() { PROFILE_START("Function Name"); for(int i=0; i<BigNumber;i++) { Do_Something_Expensive(slowly); } PROFILE_STOP(); } 

Manual instrumentation allows a programmer to store and use the timing information in any way that they see fit - on-screen profile bars, graphs or possibly profile data output to file for analysis in a separate application. It is the most versatile method of profiling, but care must be taken that the actual implementation of the profiling stubs aren’t too intrusive, slowing down the execution of the code being profiled. The graph above was produced using manual instrumentation.

Automatic instrumentation is performed by most modern profilers, and will allow runtime insertion of start/stop profiling stubs into code. This method is very useful, allowing a programmer to change which functions to instrument at run time, but relies on the profiling data being collected and displayed by an external application.

Another form of automatic instrumentation is call profiling which relies on the compiler inserting pre- and post-amble profiling code to every function. This method of profiling gives 100% coverage but is very intrusive (it slows down the running of the profiled code significantly).

With both types of instrumented profiling, the more instrumentation that occurs, the more intrusive it is and the slower the code will run. Instrumenting small, high frequency functions can artificially bloat the time taken in those functions and their parents, so care must be taken not to instrument too much.

Example of high frequency instrumentation slowing performance.

In the above diagram, each coloured bar corresponds to a function call, with the top ones calling the ones directly below them and so on, down to the fine grain leaf functions at the bottom. While this visual representation of the code path is useful for understanding the flow of code, it does slow down parts of the code non-uniformly and isn’t necessarily an accurate depiction of the relative performance of the profiled functions.

### Program Counter Sampling

This form of profiling requires an external profiler which samples the executing code’s program counter (PC) at set intervals over many frames, collecting these values for later analysis. The Program Counter is a CPU register which stores the current address of execution. Each PC value and the time that it was taken at can be reconstructed to determine which functions were being executed at which time. From this data we can see the overall execution footprint of the executing code – slow code will have more samples attributed to it while the fast code will have less.

PC Sampling Example in SN Tuner

In the above example, we can see that memory allocation is taking a long time and should be investigated further. The annotation has been added manually.

PC sampling, as with instrumentation, can be intrusive and affect the execution speed of the code. The higher the frequency of sampling, the more intrusive it is (although, in this case it has a uniform impact across the entire frame) , so it is advisable to minimise the sampling frequency and collect over many frames instead to get a more indicative performance measurement unless you explicitly need high frequency information for some reason.

### Profiling the Particle System

So, back to the problem at hand; we profiled the particle system using PC sampling and the profiler  attributed the samples to the following lines of code:

ParticleManager::Update() samples

The first column is the total number of samples at that PC, while the second column is the number of samples for that frame. Basically, the larger the number attributed to a line, the slower that line is to execute. The problem is that the PC samples correspond to particular assembly instructions and the application is responsible for determining which lines of C++ source those instructions are part of. This is a non-trivial problem (See the interactive example at the end of this lesson), as optimised code will generally move instructions around into an order which provides the best performance, so what we are left with is an approximation of where the slowest lines of code are. At the function granularity these samples are accurate, just not at the level of C++ instructions.

So, in order to accurately see what is taking the most time within a function, we must look at the disassembled instructions for this code fragment. I know what you’re thinking. You’re thinking

“Assembly!? I can’t read that. Only really old guys and freaks can read that!”

But don’t worry, it isn’t necessary to be fluent in the asm of your platform of choice, but it is important that you are able to understand the gist of the disassembled code. It’s like being in a foreign country – you don’t need be conversational (although that is ideal) but it will help immensely of you know some of the basics; the “please”, “thank you” and “Excuse me, where’s the bathroom?” – the phrases you use all the time.

For this lesson, I’ll not attempt to teach any assembly. I’ll leave that as an exercise for the reader  - it should suffice for this PPU based example to know that instructions starting with an ‘l‘ are load instructions, and those starting with ‘s‘ are stores. Instructions starting with an  ’f‘ are floating point operations – the full mnemonic is generally enough to guess the rest. If you want excruciating detail, the PPC books can be found here.

The next image shows the PC samples attributed to the individual assembly instructions. This tells us accurately which instructions are taking the longest to execute.

The numbers on the left are the samples attributed to that line of assembly, and this tells exactly what code we should be looking at. So, how do these instructions relate to the C++ code?  Well, I’ve coded up a simple interactive page which will allow you to step through the C++ of the inner loop and its generated assembly instructions interactively (as long as you have a HTML5 compliant browser that is – otherwise you’ll just get a blank page. In which case you need to go here). The C++ code being executed is highlighted on the right and the corresponding ASM on the left is shown so you can see what the code compiles into. Take note of the fact that there isn’t a direct linear relationship between the C++ and the ASM, the generated ASM can be quite out of order (even in this simple example). The blue box in the bottom right give a description of what is happening for that line.