Overbyte - Blogs from Tony Albrecht http://overbyte.com.au/index.php/overbyte-blog/blogger/listings/talbrecht Tue, 21 Nov 2017 18:04:50 +1030 en-gb Optimisation Lesson 3: The Memory Bottleneck http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck

## Welcome to Lesson 3 of the Overbyte Optimisation series.

If you cast your mind back to the end of lesson 2, we were left with the following code and corresponding samples:

(Note that the samples here are not exactly the same as the ones in the previous lesson, even though the code is. This is due to a change in the version of the compiler being used – we now have faster, more interesting code)

From looking at the source, there is nothing obviously expensive in the calculations at the highlighted points – no branches, square roots, divides or other non-pipelined instructions. So, there must be something other than instruction latency causing the performance problem.

Lets have a look at the disassembled code and see if that sheds any light:

I’ve highlighted the two biggest samples – almost an order of magnitude larger than the other samples (note that the order of the samples is swapped – this is a good example of the compiler reordering instructions into a more optimal format). From the last lesson, we know that fadds and fmadds are only 10 cycles long, so that shouldn’t cause the large number of hits. Let’s look at the C code in relation to the generated assembly:

In the optimised code the y position is first loaded, followed by the X velocity and Y velocity, then the x velocity is added to the already loaded x component of the acceleration value. It’s this addition that seems to be taking most of the time –  (we’ll look at the reason for the second stall later in this lesson). What is actually happening is that the fadds instruction is stalling, waiting for the x velocity to be loaded into the f10 register as the addition can’t take place until it has the correct data in all the appropriate registers. The large number of hits on this instruction implies that this memory load is taking a very long time indeed.

## Memory access

In order to load data into a register on the CPU it must be transferred from the memory subsystem. Back in the good old days, memory access times were comparable to instruction times – each in the order of a few cycles. Over the following decades improvements in CPU speeds followed Moore’s law increasing their performance logarithmically while memory access speeds crawled along, improving performance by only 10% per year.

Comparison of relative speeds of processors and memory.

In order to reduce this memory bottleneck, CPU designers added a small amount of very fast, very expensive memory called a cache. These caches come in different flavours – Level 1 cache  is the fastest and smallest and is generally found on chip. It can only pull data from Level 2, which is larger and slower while level 3 cache (if your CPU has one) is larger and slower again. The highest order cache in your system is the one that retrieves data from main memory. Each element in a cache contains not only some data that corresponds to a copy of the data in a location in main memory, but also information on where that data comes from. Each cache can only access data from the level below (or main memory if its is the highest order one). A great explanation of how the different types of caches work can be found here.

In the case we are looking at, the instructions request some data to be loaded into a floating point register. The CPU will first query the L1 cache for that data and if it is already resident there, will return it within a matter of a few cycles. If it is not there then the L2 cache is queried and if it is present there then it will be returned as soon as possible, which in this case  (on the PS3) will be around 30 cycles. If however, the required data isn’t in the level 2 cache, then main memory must be accessed which takes hundreds of cycles (somewhere between 400 and 600 on PS3 and X360). This is an eternity as far as the CPU is concerned – hundreds of instructions can be executed in the time it takes to pull data from main memory.

If you had a single instruction that could take 100s of cycles, how often would you use it?

The following interactive page illustrates the relative times for accessing data from L1, L2 and main memory. Click on the different memory subsystems to see  how slow main memory is. (Page requires HTML5)

Interactive demo of relative cache/memory access times

As you’ve experienced above, retrieving data from main memory is very slow – painfully slow. (In this demo, the latency for the different gaps are; L1$-> CPU: 4 cycles, L2$ -> L1$: 30 cycles, Main Mem ->L2$: 600 cycles). An L2 cache miss is in the order of 50 times slower than the average instruction latency.

## Minimising the memory bottleneck

So, what can we do about it? We need to use memory, so how do we minimise the impact of these long reads from main memory?

The simplest way is to just use less memory – minimise the size of your regularly used (hot) structures. Do you really need to an array of rarely used data (eg char name[32];)  in the struct or can you just maintain a pointer to it? If it is a large structure, can you break it up into hot and cold parts, or at least parts that are used in sequence?

The third optimisation is to use a special instruction that will start fetching some data for us, ideally well before we need it so that when we do need it, it will hopefully be in the L2 cache ready for us. This is called prefetching and, when used effectively, can have a dramatic effect on your code’s performance.

## Prefetching

On the PlayStation3, the prefetch instruction is __dcbt(loc), where loc is the location of the data to be prefetched. It triggers a load of a single cache line (128 bytes)  into L2 cache. Let’s add one into our loop and see what happens:

Well, our stalls seem to have vanished, but what about the performance? The original code was running at around 10.8ms – this code now executes in 4.5ms! Nice! We’ve sped up our code by a factor of two by adding an instruction.

I chose (p+10) as the location to prefetch as the inner loop is about 40 cycles and fetching 10 particles ahead corresponds to 10*40 = 400 cycles which is approximately the time taken to fetch data from main memory. The graph below shows the different  times taken for different prefetch distances:

The first column corresponds to the code without the prefetch instruction, while the second column prefetches the particle that we are about to use in that iteration – note this indicates the cost of the prefetch instruction is non-zero, but very soon the effect of prefetching mitigates that cost. At 8 to 10 particles look ahead (192 to 240 bytes) we start to plateau. Now it is worth noting that the __dcbt() instruction loads in 128 bytes at once, so when we’re calling __dcbt() within our loop, we’re actually prefetching the same cacheline 5 times. This is still cheaper than testing if the next particle to be prefetched is in the same cacheline or not (if we unrolled the loop to do 5 particles (5 * sizeof(MyParticle) = 120)  at once we’d get better prefetch distribution (as well as pipeline utilisation), but we’ll look at loop unrolling in a later lesson).

## The reason for the dual stalls

The reason for the dual stalls in the original code is due to the size of MyParticle relative to the size of the cache line. When a single particle (MyParticle is 24 bytes in size) is loaded, an entire 128 byte cache line is loaded. This means that 5 particles will be loaded in with that one, as well as part of the next one.  The following diagram shows the distribution of particles over 3 consecutive cache lines (the pattern repeats every 3 cache lines).

The first case where the struct aligns exactly with a cache line results in a hit on the first stall as the x velocity is used. The next 4 particles are all within that cache line and so do not incur another cache miss penalty. The first miss happens loading the 6th particle and contributes to the first stall once again – py is in cache, vx is the first miss.  The next miss contributes to the second stall as vx is already in the cache and the processor has to wait while vy is fetched from main memory. This behaviour matches our observations with approximately 2/3 of the hits being on the first stall and 1/3 on the second. When we added prefetching both of the stalls were removed (as we would expect).

## Summary

What we’ve seen here is how to identify and fix a simple system which is preforming sub-optimally due to cache misses. The memory used is already contiguous and homogeneous, so prefetch look aheads are easy to implement. More complex memory access patterns can be considerably more difficult to optimise, so I would suggest that one of your first passes at optimising a system (after algorithmic optimisation) is to simplify your memory access. Determining what data is used in a tight loop and then ensuring that data is stored in a cache friendly fashion will not only give you an instant speed boost, but will make it easier to prefetch and thereby optimise further.

When designing a system that needs to be fast, it is critical that you consider the layout of the memory being used. Optimal code is irrelevant if your data is badly organised.

TLDNR: put all your data in flat arrays.

# Coming up next

Now that we’ve removed the memory bottleneck, we’ll next have a look at further optimising this code by addressing the instructions used. In particular, we’ll start using some SIMD instructions and see how that impacts our performance.

Until then, remember, it’s all about the memory.

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Fri, 23 Mar 2012 15:46:00 +1030
Optimisation Lesson 2: The Pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline

Before we start on the second in this series of optimisation lessons I’d like to make something clear. The type of optimisation that I am predominantly talking about in these lessons is low-level optimisation, or micro-optimisation.  This type of optimisation should only be applied once you are sure that it will definitely have a benefit, once you have exhausted all obvious higher level macro-optimisations. There have been (too many) times in my coding career when I have spotted some obviously inefficient code and spent too much time joyously optimising it only to end up with rigid, highly optimised, unmaintainable, very clever code that makes absolutely no difference to the overall performance. Or, even worse, it runs slower.

Your first task in optimisation is to figure out what is slow, then why that something is slow (and sometimes when) and then remedy that.

## Understanding the sample distribution

If you can cast your minds back to Lesson 1, we had profiled our particle system and had started examining the PC samples obtained from our profiler as depicted below.

I’ve highlighted the two lines with the highest hits and we’ll start by focusing on the second, biggest one first.

 blt 0x00011B30 

From the interactive C++ to ASM page from the last lesson we can see that this is a branch command and corresponds to the following C++ code:

 if(p.m_Position<0) 

To understand why this instruction is such a bottleneck we need to understand a little about how the underlying hardware functions – in particular the CPU pipeline. Modern CPUs can issue a command or two every cycle. These instructions take a variable number of cycles (depending on the instruction) to produce an output, but (and this is important), multiple instructions can be in flow at the same time – each one at a different stage in the pipeline.

For example, consider the following code which does 3 single precision floating point adds:

 fadds f4,f4,f0   fadds f13,f13,f30   fadds f0,f31,f29 

If each fadds instruction takes 10 cycles, how many cycles does it take to complete these three instructions?

The answer, as with most things when you get deep enough, is “It depends”. In order to simplify this problem let’s assume that the registers that are being used already have the relevant data in them – this way there is no stalling as the fadds instructions wait for the registers to fill from D cache or main memory (we’ll deal with that can of worms in the next lesson).

The following diagram shows some of the details of the PPU pipeline. It’s from this article and is definitely worth reading if you are at all interested in the Cell BE architecture.

Part of the PPU Pipeline

Each box in this diagram corresponds to a cycle – for example, the first 4 boxes, IC1 through IC4 correspond to the time taken for the processor to fetch instructions from the Instruction Cache –  4 cycles –  followed by a two cycle dispatch and then three cycles to decode and three more to issue the instruction to either the Branch Unit, Fixed Point Unit or the Load Store Unit. All instructions have to go through this 12 cycle front end, but if this pipeline is fully populated then a new instruction can be dispatched every cycle. The time taken for an instruction to enter, pass through and then leave a pipeline is called it’s latency. I won’t go into too much depth on the details of CPU pipelines here – check your hardware documentation for more information (for consoles this is generally under NDA anyway) and/or have a read of this Ars Technica article.

Back to the case in hand: The FPU pipeline is 10 cycles long which means it will take 10 cycles for a single instruction to pass through that pipeline (the time taken to get the instruction to the FPU can be ignored for this calculation). A new FPU instruction can be issued every cycle after that; this means that f0 from the third fadds will be ready after cycle 12.

It is worth noting at this point that some instructions (such as divide and square root on the PPU), do not pipeline at all. This means that all the instructions after that command will stall until it has finished, reducing throughput significantly (some non-pipelined instructions take more than 80 cycles).

The more attentive amongst you may have noticed the reuse of the register f0  in the third fadds – this is not an issue as once an instruction is committed into the pipeline the input registers are no longer relevant to that instruction. The compiler will take care of dependencies for you, so you shouldn’t need to worry about it, merely be aware of it.

## Branching

“So,” I hear you ask, “what does this have to do with the branch statement in the code we’re looking at?”

Well, I’m getting to that. Branching can be an expensive operation and the PPU tries to minimise its impact by trying to predict which way the code will branch. It does this either statically (the compiler or programmer specifies which code path is most likely taken) or dynamically (the PPU keeps a table of branches and which path the most recent branches have taken and uses that to infer the most likely branch).  Check out Mike Acton’s article on branching for more details.

A mispredicted branch will incur a complete pipeline flush which is quite expensive (I’ve not mentioned the exact value for the PPU as I’m not sure if it is covered by NDA but you should be able to infer it from the earlier pipeline diagram) and even an accurately predicted branch has a slight cost.

If we once again look at the performance graph from the first lesson we are now in a position to understand what is happening to produce this peak on the right:

Graph of execution time taken over frame number (from Lesson 1).

The flat part of the graph on the left corresponds to the expansion of the particle ball while all particles are above the collision plane. When the particles start to collide with that plane, the time taken increases dramatically, rising from around 15ms per update to nearly 60ms (when nearly all the particles are at rest on the collision plane). This increase is processing time is due to not just the extra processing when a collision occurs (three fp multiplies and a fp negate), but also due to the branch and (potential) associated pipeline flush.

## Removing Branches

Fortunately, there is an instruction that can help us here; fsel (floating point select). This instruction provides a way of effectively performing our own speculative execution and then choosing one of two operands based on a third (see below):

 d = __fsel(a,b,c); // is the same as the following code but with no branch instruction if(a>=0)      d = b else     d = c; 

The other benefit of this instruction is that it is pipeline friendly – it also takes 10 cycles and our pipeline is left intact once it has completed. And we can have multiple of these instructions in the pipeline at the same time.

So, let’s rewrite our particle code without any classes (to make it easier to see how it maps to asm), pre-emptively calculate the bounced position and velocity and pick the resultant position and velocity via fsel() commands based on the y value of the particle’s position.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 void ParticleManager::Update(float fClock) { const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0.0f, -9.8f, 0.0f) * fClock; // a*t const MyVector at2 = v3Acceleration*fClock*0.5f; // ½a*t^2   MyParticle *p = &m_Particles[0]; MyParticle *pEnd = &m_Particles[m_NumParticles]; for (; p<pEnd; p++) { float newPosy = p->m_Position.y; float newPosx = p->m_Position.x; float newPosz = p->m_Position.z; float newVelx = p->m_Velocity.x; float newVely = p->m_Velocity.y; float newVelz = p->m_Velocity.z;   newPosy = newVely * fClock + newPosy + at2.y; newPosx = newVelx * fClock + newPosx + at2.x; newPosz = newVelz * fClock + newPosz + at2.z; newVelx = newVelx + v3Acceleration.x; newVely = newVely + v3Acceleration.y; newVelz = newVelz + v3Acceleration.z;   float bouncedPos_y = -1.0f*newPosy; float bouncedVelx = newVelx*dampVelocity.x; float bouncedVely = newVely*dampVelocity.y; float bouncedVelz = newVelz*dampVelocity.z;   // choose vel based on y pos newVelx = __fsels(newPosy,newVelx,bouncedVelx); newVely = __fsels(newPosy,newVely,bouncedVely); newVelz = __fsels(newPosy,newVelz,bouncedVelz); // set y based on y pos newPosy = __fsels(newPosy,newPosy,bouncedPos_y);   p->m_Position.x = newPosx; p->m_Position.z = newPosz; p->m_Position.y = newPosy;   p->m_Velocity.x = newVelx; p->m_Velocity.y = newVely; p->m_Velocity.z = newVelz; } } 

And here is the assembly generated by compiling the above code.

 0x000003D0 bge 0x00000458 0x000003D4 lfs f9,0x4(r4) 0x000003D8 lfs f10,0xC(r4) 0x000003DC lfs f11,0x10(r4) 0x000003E0 lfs f12,0x14(r4) 0x000003E4 fmadds f9,f1,f11,f9 0x000003E8 lfs f13,0x0(r4) 0x000003EC fadds f0,f10,f5 0x000003F0 lfs f31,0x8(r4) 0x000003F4 fadds f11,f11,f4 0x000003F8 fadds f30,f12,f5 0x000003FC fmadds f10,f1,f10,f13 0x00000400 addic r5,r4,0x18 0x00000404 fmadds f12,f1,f12,f31 0x00000408 fmadds f13,f6,f2,f9 0x0000040C fnmadds f9,f6,f2,f9 0x00000410 fmuls f31,f8,f0 0x00000414 fmuls f29,f7,f11 0x00000418 fmuls f28,f8,f30 0x0000041C fadds f10,f10,f3 0x00000420 fadds f12,f12,f3 0x00000424 fsel f9,f13,f13,f9 0x00000428 fsel f0,f13,f0,f31 0x0000042C fsel f11,f13,f11,f29 0x00000430 fsel f13,f13,f30,f28 0x00000434 stfs f10,0x0(r4) 0x00000438 stfs f12,0x8(r4) 0x0000043C stfs f9,0x4(r4) 0x00000440 stfs f0,0xC(r4) 0x00000444 stfs f11,0x10(r4) 0x00000448 stfs f13,0x14(r4) 0x0000044C clrldi r4,r5,32 0x00000450 cmplw r4,r3 0x00000454 blt 0x000003D4 

If you have a HTML5 compliant browser, then you can interactively walk through the ASM and C++ code in another window by clicking on the following image:

As you can see from this assembly (particularly via the interactive walkthrough), the only branches are associated with the for loop which iterates through all of the particles.

We would expect this code to run faster as there are no more data accesses and there are less stalls in the worst case scenario (all particles mis-predicting a branch), but how much faster? The following graph shows the performance of the code in lesson 1 against this lesson’s performance.

As you would expect, the performance is constant as there is no branching based on the particle’s position, but what you might not have expected is that the baseline performance is better than the original solution. Even though we are calculating the bounced position and velocity in addition to the non-bounced ones for every particle, this code is still significantly faster than the branching one. Basically, we are doing more work in fewer cycles. The removal of the branch removes the need for a compare (which stalled waiting for the result of the increment of the y position) as well as the branch and, thereby allows better pipelining of the inner loop.

This is a pretty good performance improvement for a fairly small change, but finding wins like this in production code is rare. This type of optimisation really only benefits very tight inner loops; using branchless logic in higher level code will usually have no measurable performance improvement at the cost of obfuscating your code – so choose wisely.

The important things to take away from this lesson are:

1. Instructions pipeline. Taking advantage of that can give you big performance wins
2. Branching can be bad on certain architectures (primarily because it trashes the pipeline)

## Coming up next

So where to next? Let’s have a look at the PC samples of this new code and see where the bottleneck has moved to:

Samples attributed to C++ code

Most of the samples in this code seem to have coalesced around the calculation of the y position of the particles. The reason for this is very interesting and we’ll address this in the next lesson.

So endeth the second lesson. Please leave a comment if you’ve spotted an error, need something clarified or would just like to leave some feedback.

 I’ve just been pointed at an excellent article on pipelines and branchless maths: “Down With fcmp: Conditional Moves For Branchless Math”

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Thu, 10 Nov 2011 15:43:00 +1030
Optimisation Lesson 1: Profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling

The performance of a game is critical – it defines the scope of a game. It specifies how many, how far, how much detail, the level of freedom, even the length of the project. Whether you are an artist, designer, producer or programmer, the performance of a game has an impact on you. As Jeff Atwood recently stated, “Performance is a feature” and as with all features, they need to be scheduled, researched, implemented and maintained.

This article is the first in a series based on the Masterclasses which I’ve recently presented for Game Connection, and will walk you through several phases of optimisation performed on a simple piece of code. It is primarily aimed intermediate programmers  or experienced programmers who haven’t had much to do with performance and low level programming. The aim is to give a detailed example of finding performance bottlenecks and then addressing them. It should give you and idea of how code can be optimised and the magnitude of performance improvements you can expect from different changes. The series of lessons will cover profiling, branch removal, cache aware programming, loop unrolling and SIMD programming.

## Introducing A Simple Particle System

The first piece of code we’re going to look at is a very simple particle system. We will look at the update phase only, ignoring the rendering and GPU performance for this example considering only the CPU for now. The CPU in this case is the PlayStation®3′s PPU and external profiling is done with SN System’s excellent Tuner.

There are a couple of simple classes, the vector class  MyVector and the particle class, MyParticle.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 class MyVector { public: MyVector(){}; MyVector(float X, float Y, float Z) {x=X;y=Y;z=Z;}   inline const MyVector& operator+=( const MyVector& rhs ); float x,y,z; };   inline const MyVector &MyVector::operator+=( const MyVector& rhs ) { x += rhs.x; y += rhs.y; z += rhs.z; return *this; }   inline MyVector operator+( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z ); }   inline MyVector operator*( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x * rhs.x, lhs.y * rhs.y, lhs.z * rhs.z ); }   inline MyVector operator*( const MyVector& lhs, float rhs ) { return MyVector( lhs.x * rhs, lhs.y * rhs, lhs.z * rhs); } 

and the particle class;

 1 2 3 4 5 struct MyParticle { MyVector m_Position; MyVector m_Velocity; }; 

The particles in this system will accelerate under gravity and will bounce on the y equals zero plane with some damping applied to approximate energy lost in collision. If we can remember back to our high school maths, the equations of motion are;

${v}' = v + at$

${p}' = p + vt + \frac{1}{2} at^{2}$

So, the update loop for the code is:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 void ParticleManager::Update(float fClock) { int iParticle; const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0, -9.8, 0) * fClock; // a*dt   for (iParticle=0; iParticle<m_NumParticles; iParticle++) { MyParticle &p=m_Particles[iParticle]; // v' = v + a*dt p.m_Velocity += v3Acceleration;   // Update the particle position // p'=p + v * t + 0.5 * a * dt^2 p.m_Position += p.m_Velocity*fClock + v3Acceleration*fClock*0.5f;   // if Particle is below the ground plane, then damp and reverse the velocity if (p.m_Position.y<0) { p.m_Position.y*= -1; // reflect position along y p.m_Velocity= p.m_Velocity*dampVelocity; } } } 

This code gives us the following output (for 256k particles)

A simple particle system in action.

## Step 1: Profile

The first thing we need to do is figure how fast this code is running, and for that we need a useful metric. Frames per second is borderline useless and a percentage of a frame is just as bad. Use a sensible metric like milliseconds or microseconds – then you can then accurately measure the length of a function independent of the frame rate. Also, if possible, capture profiling information over a number of frames. This will allow you to find spikes and patterns in the execution of your code. When optimising code, the spikes are generally what need to be addressed first (and are often the easiest to fix too).

Consider the following graph of this particle system. The vertical axis is milliseconds taken to process all of the particles and the horizontal axis is time (# frames)

Graph of execution time taken over frame number.

Note the shape of the graph – it starts at around 17ms per frame and then dramatically sweeps up to sitting at about 60ms per frame. There is obviously something interesting happening here that requires further examination. Lets run a profiler over this code and see if we can get a better idea of what is happening with this example.

There are two main types of profiling that are generally used; instrumented and sampling.

### Instrumented Profiling

Instrumented profiling relies on either manual or automatic insertion of code that stores a time stamp at set points for later analysis. Manual instrumentation relies on the programmer inserting code which wraps the code to be timed, for example, at the start and ends of a function.

 73 74 75 76 77 78 79 80 81 void FunctionThatIsTakingALongTime() { PROFILE_START("Function Name"); for(int i=0; i<BigNumber;i++) { Do_Something_Expensive(slowly); } PROFILE_STOP(); } 

Manual instrumentation allows a programmer to store and use the timing information in any way that they see fit - on-screen profile bars, graphs or possibly profile data output to file for analysis in a separate application. It is the most versatile method of profiling, but care must be taken that the actual implementation of the profiling stubs aren’t too intrusive, slowing down the execution of the code being profiled. The graph above was produced using manual instrumentation.

Automatic instrumentation is performed by most modern profilers, and will allow runtime insertion of start/stop profiling stubs into code. This method is very useful, allowing a programmer to change which functions to instrument at run time, but relies on the profiling data being collected and displayed by an external application.

Another form of automatic instrumentation is call profiling which relies on the compiler inserting pre- and post-amble profiling code to every function. This method of profiling gives 100% coverage but is very intrusive (it slows down the running of the profiled code significantly).

With both types of instrumented profiling, the more instrumentation that occurs, the more intrusive it is and the slower the code will run. Instrumenting small, high frequency functions can artificially bloat the time taken in those functions and their parents, so care must be taken not to instrument too much.

Example of high frequency instrumentation slowing performance.

In the above diagram, each coloured bar corresponds to a function call, with the top ones calling the ones directly below them and so on, down to the fine grain leaf functions at the bottom. While this visual representation of the code path is useful for understanding the flow of code, it does slow down parts of the code non-uniformly and isn’t necessarily an accurate depiction of the relative performance of the profiled functions.

### Program Counter Sampling

This form of profiling requires an external profiler which samples the executing code’s program counter (PC) at set intervals over many frames, collecting these values for later analysis. The Program Counter is a CPU register which stores the current address of execution. Each PC value and the time that it was taken at can be reconstructed to determine which functions were being executed at which time. From this data we can see the overall execution footprint of the executing code – slow code will have more samples attributed to it while the fast code will have less.

PC Sampling Example in SN Tuner

In the above example, we can see that memory allocation is taking a long time and should be investigated further. The annotation has been added manually.

PC sampling, as with instrumentation, can be intrusive and affect the execution speed of the code. The higher the frequency of sampling, the more intrusive it is (although, in this case it has a uniform impact across the entire frame) , so it is advisable to minimise the sampling frequency and collect over many frames instead to get a more indicative performance measurement unless you explicitly need high frequency information for some reason.

### Profiling the Particle System

So, back to the problem at hand; we profiled the particle system using PC sampling and the profiler  attributed the samples to the following lines of code:

ParticleManager::Update() samples

The first column is the total number of samples at that PC, while the second column is the number of samples for that frame. Basically, the larger the number attributed to a line, the slower that line is to execute. The problem is that the PC samples correspond to particular assembly instructions and the application is responsible for determining which lines of C++ source those instructions are part of. This is a non-trivial problem (See the interactive example at the end of this lesson), as optimised code will generally move instructions around into an order which provides the best performance, so what we are left with is an approximation of where the slowest lines of code are. At the function granularity these samples are accurate, just not at the level of C++ instructions.

So, in order to accurately see what is taking the most time within a function, we must look at the disassembled instructions for this code fragment. I know what you’re thinking. You’re thinking

“Assembly!? I can’t read that. Only really old guys and freaks can read that!”

But don’t worry, it isn’t necessary to be fluent in the asm of your platform of choice, but it is important that you are able to understand the gist of the disassembled code. It’s like being in a foreign country – you don’t need be conversational (although that is ideal) but it will help immensely of you know some of the basics; the “please”, “thank you” and “Excuse me, where’s the bathroom?” – the phrases you use all the time.

For this lesson, I’ll not attempt to teach any assembly. I’ll leave that as an exercise for the reader  - it should suffice for this PPU based example to know that instructions starting with an ‘l‘ are load instructions, and those starting with ‘s‘ are stores. Instructions starting with an  ’f‘ are floating point operations – the full mnemonic is generally enough to guess the rest. If you want excruciating detail, the PPC books can be found here.

The next image shows the PC samples attributed to the individual assembly instructions. This tells us accurately which instructions are taking the longest to execute.

The numbers on the left are the samples attributed to that line of assembly, and this tells exactly what code we should be looking at. So, how do these instructions relate to the C++ code?  Well, I’ve coded up a simple interactive page which will allow you to step through the C++ of the inner loop and its generated assembly instructions interactively (as long as you have a HTML5 compliant browser that is – otherwise you’ll just get a blank page. In which case you need to go here). The C++ code being executed is highlighted on the right and the corresponding ASM on the left is shown so you can see what the code compiles into. Take note of the fact that there isn’t a direct linear relationship between the C++ and the ASM, the generated ASM can be quite out of order (even in this simple example). The blue box in the bottom right give a description of what is happening for that line.