Overbyte - Blogs from Ben Driehuis http://overbyte.com.au/index.php/overbyte-blog/blogger/listings/ben-driehuis Fri, 22 Sep 2017 14:11:53 +0930 en-gb Vessel: Common Performance Issues http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues It is time for me to write my final blog on the adventure that was, ‘Vessel PS3’, since my time at Overbyte is coming to an end. This project was tough, however, it is one I am proud we have conquered. There are a lot of interesting and cool optimisations that I myself have done whilst on this project, but since I won’t have the time to write a few blogs about them all, I thought I would highlight a few of the common things I found with some examples of things we did. I will leave some of the more interesting ones for Tony to cover later.

Using containers badly

There were quite a few places in Vessel in which we had to fix up the use of vectors, lists and other similar containers. Obviously you need to use the right ones, to do the right job by understanding what your code needs your container to do and then picking the right one.

But what if you choose the right container or a near enough one? There are still many performance traps you can quickly run into.

Let’s do a quick example. What is wrong with the following code?

vector<PipeEffectDefinitionItem> defs = m_EffectDefinition.GetRef()->m_Definitions;

for(int i=0; i<m_vfxInstances.size(); ++i)


   PipeEffectVfxReference& vfx = m_vfxInstances[i];



Did anyone notice the problem? We are grabbing a vector of these PipeEffectDefinitionItems from another object, but instead of using a reference, we are copying them. This happens in an update code of a particular class, which can be called a few times a frame due to other interactions. In each of them it was doing the same thing. Sometimes it is valid to copy a container, if you know the data might change or it is temporary. However in this case, we are referencing static data, so changing the first line to vector<PipeEffectDefinitionItem>& means we don’t need to allocate a whole lot of memory.

Memory allocation and de-allocation was a huge problem for us, with all the threads fighting each other. I have seen many code bases suffer from the same problems in my career. The things we found ourselves doing often in Vessel, was to attempt to work out the size of containers outside a loop, instead of just adding items to them inside it and causing lots of resizing of the containers.

Caching Data

Another common way to solve performance issues is to cache data. One of Vessel’s biggest function costs on the game thread, was a lookup which searched all persistable items in our game to see if it matched a HashID. The idea behind it being, that since some things are spawned at run time and some during load, they could safely reference each other and this system would look it up and let you access it, if it existed. This would be done throughout the code base for each frame and whilst the lookup itself was quite efficient, it was still expensive.

When dealing with a code base you don’t really know, the hardest thing is learning why things are done the way they are. Tony’s blog a few weeks ago talked about this. You cannot assume that the programmer before was stupid. They wrote the code the way it was for a reason. Sometimes they just wrote it in the most simple way and didn’t have the same needs/performance considerations you do. Sometimes the game/systems around it changed which means either a system is adjusted or the system is used in a way that it was not intended.

In this case it was clear that on PC the performance cost of looking it up each time was minimal to nothing, whereas the effort required to track object lifetimes, caching and removing cache, handling level transition and everything else would result in such an insignificant gain it was never looked at. For PS3 however we needed to cache these lookups! Here is the original call to GetRef.


And the cost of the function over a typical frame:


Caching this value required a few changes and a few assumptions. We knew that all things referenced by this system were all persistable, so we could add our required call back functions for both destruction of the persistable, as well as the removal of the reference to it on that class. We also assumed that the constructor would only be used to create new items, not used to re-assign items. This meant that we knew that we never had to check the cache in constructors. For all of our assumptions we had to write code to catch when we were wrong and adjust lots of other code around these. But in the end, what we were able to do is change GetRef to:


And now looking at its performance, we see a 0.5ms improvement.


Inefficient Data Searching

The first thing you should probably do in a code base when looking to optimise is to search for something like TODO: Make faster. Chances are it is probably slow. Whilst I am kidding, I did find one such comment after noticing a function using a lot of time via hierarchy view. This function was used to find both functions and properties so that lua and the game code could talk to one another. Here are the pieces of code in question.


There are two inefficient things in this code. Firstly we are using a vector to find the functors and properties, where we never need to iterate through the list, well at least not in any time critical, runtime code. Secondly we are doing a string comparison, ignoring case. These aren’t cheap either.  Here is a summary of a profile of this code running in one of our levels.


The solution here it two fold. Firstly we should be using an associated container as we know that

a)      The container is only added to as a start up cost, which sets up the links between the strings and the actual functors/properties.

b)      The time critical part of this function is a lookup based on a key

c)       We almost never do iteration through all elements (except to find the key as above)

Secondly we know that the string compare is not fast. In a case like this you want to convert it to a hash which will be a lot faster. The Vessel code base already uses hashes instead of strings in most places, but this one was never changed. So we now have the follow code:


The difference we can see below, a small saving, but the key thing in the max is reduced, so less peak slow downs. It is always good to review any functions which are called a lot, especially if they seem to be taking a lot of time in comparison to the work they should be doing.


Doing things unnecessarily

If you don’t understand a codebase, like in our case when we started Vessel, knowing what each function and system is trying to do and stopping it from doing things it doesn’t need to do can be extremely difficult. Now that we are at the end of the project, we are pretty familiar with the code and thus have found many areas where the code was doing things it didn’t have to, or being overly complicated.

For instance in Vessel there are objects called Water Instances. There is one water instance for each specification of liquid in the game. There could be several specifications of Water or Lava. The problem was the water instance code also managed sounds and lights for the liquids based on flow, collisions and other information. This code would find all liquid of a liquid type (Water, Lava, Red Goo) and play/stop sounds and generate lights for them. As we could have up to 9 water instances of a single type, this meant the same liquid was processed 9 times and 9 x lights and sounds were being played.

Obviously the issue here was that originally water instances and water types were a 1 to 1 relationship, but as the game grew, this moved apart and as there was no performance impact on PC, it wasn’t revisited. But on PS3 this was causing us huge issues, going through all the liquid in the game is slow, so to do it 9x more then necessary was bad. Also fmod was not appreciating the heavy load it was being put under and the GPU was struggling drawing 9 lights in the same location. Here is the profile:


We simply just cached if we had done the lighting and sounds checks for a liquid type each frame. If we had, we didn’t do it again. This did have some repercussions, as the content as all tweaked around this. So we had to increase sound volumes and some other settings, and the intensity of lights.

Further inside this class it was handling collisions and keeping a list of data about the interactions between drops of the fluid. This list was reusing a pool of data, to stop memory allocations and de-allocations. The list was checked each frame to remove any which were now obsolete and then checked for room to add new records later.

This code again was quite slow, as it was going through 100 records per 35 water instances. And most of the time, as most water instances do not have liquid on screen, there was no records to clean up. However the code did not know this and continued to check each empty record. Thus we did a minor optimisation to this, making the object know if it had any records being used. This way the code did nothing for the majority of the water instances.

This was a much smaller saving, but an indication of some of the real simple changes you can make to stop code doing things when it doesn’t need to. Let’s have a final look at the water collisions:



Anyway I hope there was some interesting information in there for you. Thankyou to everyone who took the time to read one of my blogs. Tony has heaps more in store for you, so watch this space!

ben@overbyte.com.au (Ben Driehuis) Programming Tue, 31 Dec 2013 09:33:04 +1030
Vessel: Where is the memory going? http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going Going into Vessel we (wrongly) assumed that memory usage would be not much of an issue, at least in terms of peak memory usage and getting it running on a console. However we soon learnt that the initial build of Vessel we had could not even load the main menu or any of our smallest levels without running out of memory on a retail PS3.

The first step is to find out what memory we are actually using and where is it going to. This might sound obvious and simple, but depending on your codebase and what engine and middleware you are using, there may not be any easy ways to find this out. With Vessel using its own custom engine written for a platform where memory isn’t a great concern, we had no tools to start with.

The PS3 has two areas of memory to use, 256mb of main system XDR memory and 256mb of RSX memory. This memory is completely separate from each other, with the Cell CPU being able to access the system memory fast, whilst the GPU accesses the RSX fast. Both processors could access the other area of memory but at a performance cost. Some good information on this can be found here, a few paragraphs down: http://www.playstationlifestyle.net/2013/02/27/ps4-memory-analyzed/


So we started trying to track down our usage. We used our RSX allocator to track current and peak memory usage and used sony api calls to track our system memory usage. What we soon learnt got us a little worried. We were way over memory on system (383mb used on our smallest level!) and running out of memory on RSX for most levels.

Middleware wise, we use FMOD, Lua, Boost and TinyXML. FMOD has some easy built in API to get memory usage and surprisingly I learnt that it was using 180mb just a minute into our gameplay… Thankfully this was a case of bad data, as the PS3 FMOD project had not been set to stream the music! This reduced it to a more reasonable 80mb, still high, but since all sounds which aren’t music are being loaded into memory, it was at least what we expected. So where is the other 200mb of system memory going?

Whilst we had no internal memory tracking tools, thankfully the PS3 does have a third party solution to tracking down memory called Heap Inspector. This solution developed by Jelle van der Beek and is free for PC and PS3 developers. Setting this up is quite straight forward, it just requires you to wrap your memory allocation and deallocation calls. An easy way to do this, if possible, is to use the link-time wrapping feature (--wrap). You can find the website for Heap Inspector here http://www.heapinspector.com/.


This gave us a very good rundown of exactly where the system memory was going. A big chunk of our memory was being used in the tinyxml system, which seemed strange. Since we were on quite an old version, we decided to upgrade (as there were both significant performance and memory improvements). As this involved a big api change I had to rewrite most of the xml interface code, which actually lead me straight to the major memory problem, after loading we were not clearing out the document handle properly, which meant the file buffer was being kept in memory. Since this was all raw xml, this was between 15-20mb per level, and our global data was another 30-40mb.

So now we can load most of our levels, but we couldn’t transfer between most of them. Further analysis with heap inspector showed us that fmod memory was increasing as the game went on and that a big chunk of our system memory was there to catch the rsx memory overflow. Again the fmod one wasn’t too hard to track down, originally Vessel didn’t have to worry about memory, so it never unloaded any sound data. Even if that sound was never played again. Whilst doing this after it finished playing would be bad and cost us performance (as when the sound is played again it would have to be loaded), on level transfer it was safe to clear our all the fmod event data. This meant it wouldn’t grow as the game went on.


Now we needed to solve the rsx overflow and looking at the tracking we had added ourselves, all of that overflow was due to texture data. Most games have texture streaming and a texture pool, but Vessel does not and just loads all level data. The levels are smallish though so we shouldn’t be using so much RSX! I soon discovered we were only overflowing on level transition and that’s when I worked out what was happening. On transition, we first load in the new level and data and then went through the old data and only unloaded anything that wasn’t reused. This is a nice optimisation for load times when you have memory to spare, but we didn’t have that luxury. Was there a way I could keep this in and not overrun our memory?

The good news is that the actual binary data and the data object were separated and thus what I did is change the system so it queued up the loading of the actual textures, but still loaded in the normal data objects and thus could still compare if we needed any of the texture and other asset data. Once we had unloaded all the not needed data, we then loaded in the new textures. Now we never went over RSX memory and we could get rid of our overflow!


The final memory issue we had was that sometimes loading our largest level we would run out of memory. This was because again it was using XML and the way the serialization system worked is that it read it exactly as an xml file. Thus we needed to allocate the file size buffer. Not always possible! So we had a few choices, change the serialization system so it could stream it in or find a way to reduce the file size.

Since TinyXML supported both compressed and binary xml data, which greatly reduces the size (as well as splitting the data over two files, instead of one large one), we decided to do that first. This reduced our largest level from needing a 21mb buffer to needed a 7mb and 3mb buffer. That was a lot more manageable and turned out to be more than enough for our needs. The nicer way to go would be to stream the data in and not need the buffer at all, however in our case we needed the fastest and least intrusive (since we need the PC version to work so we could access the editor).

So those were the major issues we had with memory usage. A lot of them are very specific to the issues we ran into, but the key things to remember are:

  1. You need to find exactly where the memory is going
  2. Use tools where possible, you don’t always have to re-invent the wheel
  3. For middleware, check for API calls, check updates and talk to their support
  4. Never ever trust the content!


ben@overbyte.com.au (Ben Driehuis) Programming Fri, 13 Dec 2013 10:18:00 +1030
Vessel: Lua Optimization and Memory Usage http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage Hi everyone, my name is Ben Driehuis and I am a senior programmer at Overbyte.

Vessel uses Lua to do it's glue gameplay code, allowing it to use most of the systems and tech built up throughout to form specific gameplay related classes. When we first looked into the performance of Lua and Vessel in general, we noticed very quickly that all of the threads were doing a lot of memory allocation and frees each frame which meant there was actually a lot of contention between the threads.

Lua has its own incremental garbage collector and is performed automatically. The collector runs in small steps and run time is more or less based on the amount of memory allocated. Basically the more memory you are assigning in Lua, the more the garbage collector will run. You can also manually run the garbage collector in many ways, forcing a step or doing a full run (which is not cheap). The full run through is very handy for the cases where the user won’t notice an extra 10-30ms in a frame - a good example of this are at the start/end of a level load or when showing a pause menu or some other game pausing interface.

On console memory is always a huge factor, with most optimizations involving some trade-off between memory and performance. However in a lot of cases the trade-off is heavily skewed one way or the other. In vessel our garbage collection time was quite busy. We were forcing one step of the GC per frame, after our gameplay thread had done the work needed for the physics thread to do its thing.  The result of that in this particular scene was:


An average time of 2.6ms is pretty good, but the maximum of 7.4ms is a huge concern! As we learnt above, the step time is very dependent on the amount of new memory being used each frame in Lua. In addition to the manual GC we have Lua doing its normal background incremental garbage collection during execution. From the below you can see it working away during the game thread, indicated by the calls to propagatemark (See Tony's last article, A Profiling Primer, for more info on Tuner)


Looking at our markers we quickly noticed that our processor Lua scripts were also quite expensive in this scene (the two pink bars in the above image). This marker is the main place where our gameplay world objects which change each frame exist. Thus it is also the part where the most Lua is called. Let’s have a look at the script processor update performance currently. Looking at the image below, we can see it has an average of 7.1 ms. That is not good at all!


We then started investigating what short term allocations we were making inside of Lua to try and reduce our need to do the garbage collection. Our aim was to turn off the automatic background garbage collection and just run our step at the end of each frame, with that step not taking too much time. We used third party code from Matt Ellison, who is Technical Director at Big Ant Studios to track down and display the memory we were allocating and freeing each frame. A big thankyou to Matt for allowing us to use this.

So, in this scene we determined we had about 30kb per frame in temporary Lua allocations. That’s a lot and will quickly add up! Looking at the output from Matt's tool, we saw that 20kb of that allocation came from just one Lua script, Bellows.lua,  and all of that inside its SetPercent function. The output showed a lot of temporary Vector2 construction, including a few kb from our Utils.lua VectorRotate function called inside it.

Let’s take a quick look at those two functions.



Well I think we can now see where all that Vector2 construction is coming from! Inside that for loop, we are creating three Vector2’s per iteration and then the call to VectorRotate in the loop creates another one. So that’s five per loop through on each bellow.

We can also see here there are some unnecessary calculations. Bellow.GetRotation + this.OffsetRotation is done twice without either of those value changing in each loop. We also know that GetRotation is a call into native code which is more expensive then accessing a Lua member. Material Scale is also calculated in each loop, even though it never changes value. So instead of creating all of these Vector2’s inside the for loop, lets use local variables and remove the call to the VectorRotate. This gives us:



A good performance improvement on average and looking at the temporary memory use of Lua we now saw a maximum of 23kb a frame, a 7kb saving and an average of 20kb. Our output was now not showing any Vector2 allocations, we had reduced our memory waste by just under 25% and, as we hoped, our Lua was running faster.

We still have some temp memory each frame created for each bellow and ring which we can make into a local for the function. We can also store the reference to the maths functions as locals. This will give us a small speed up in performance.

Why is this? Because Lua allows each function to use up to 250 of its own “registers”, which means Lua stores all local variables into these registers. For non-locals, it first has to get the data/function and put it into a local and then use it. If you are calling or creating something once in a function, obviously making it a local will not help as you have the one time lookup cost. But in a loop, you are saving each subsequent call! So let’s run with the following code:



A huge performance win! Memory wise though we aren’t really seeing any big improvement, still using a maximum of 22kb of temporary memory a frame and an average of 19kb. At this point I decided to find other areas heavy in bellow usage to see if this would assist me in narrowing down where all the waste is.

This brought me to the start of one our levels, where the camera zooms out over a largish area. This was showing an average of 20kb a frame with a maximum of a massive 38kb! And 17kb of that was again the Bellows. The data showed lots of memory use by calls to GetMaterial and GetLocalRect. Performance wise for our ScriptOnProcessorUpdate, we were pretty close to the above image in terms of average, at 4.33ms, but with a higher maximum of a massive 8.6ms and lower minimum at 2ms.

Looking at the code we now have we can see two interesting facts regarding the use of GetMaterial and GetLocalRect. Firstly both are called on unique items once in the loop, so we can’t cache it. However both are used to then call a function on them. That’s two calls into our native code and a temporary allocation of an object pointer and a Rect. So what happens if we make two new functions in native code to let us do it in one call, without the temporary assignment?



Another good win due to less calls between Lua and the native code and less temporary allocation. We now have an average of 14kb of temporary memory usage per frame and a maximum of 20kb for this new scene. A whole lot better! Our bellows here are now allocating a maximum of 9kb.

We can also remove the offsetVector, as we know the X component is always 0 and so we can cut down the position calculations, as we know doing mathcos(bellowOffSetRotation) * offsetVector.x will give us a result of 0, so the mathcos is a waste of time. Thus those position calculations can be simplified to:

position.x = posBase.x - (mathsin(bellowOffSetRotation) * posOffset)
position.y = posBase.y + (mathcos(bellowOffSetRotation) * posOffset)

This doesn’t really save us much temporary memory as we are just saving on one float, but it is still a slight help and the performance is slightly better. At this stage we are getting pretty close to an optimised function without changing things significantly. So let’s take a relook at the performance of the garbage collection step at the end of each frame now.


So our average is down by 0.5ms and the maximum is lower too after dealing with just one of the lua classes. Add that to the 3.5ms saving we got in the processor update itself, which some of it will also be due to garbage collection savings as we are still running automatic background GC, that’s a huge saving on the game thread.

There are a lot more areas like this in the code base where we should get similar wins and will greatly reduce our lua memory churn. The golden rules to remember are;

  • use locals when you use something more than once a function
  • try to cache of data from native code if you use it more than once
  • make condensed functions to again reduce the calls to native code. Calling one native function which does all the work is much better then calling several native functions to do the same thing.

The next step is to turn off the background garbage collection and only run it at the end of the frame for a certain amount of ms. This will require us to keep an eye on the memory usage of lua and ensure we are doing enough cleaning up to mean we can get away with limiting the garbage collection time.

There is more information about Lua performance in the article titled Lua Performance Tips by Roberto Lerusalimschy (http://www.lua.org/gems/sample.pdf).

Once again, a huge thanks to Matt Ellison and Big Ant Studios for letting us use their Lua memory tracer.

ben@overbyte.com.au (Ben Driehuis) Programming Wed, 02 Oct 2013 08:55:00 +0930