How Memory Access Patterns Can Make Or Break Performance

Part 1: Cache And Why It Matters

“You can run from the truth. You can run and hide from the truth.
You can deny and avoid the truth. But you cannot destroy the truth. Nor can you make the lie true. You must know that love will always uncover the truth.”


― Delano Johnson

Okay except that last bit about love, the rest of the quote is as much true in programming as it is in life. And 'the truth' in the context of programming is the hardware. As programmers, we should keep in mind that our code is not going to run in some virtual world with infinite resources which works according to our model of the world. Especially due to rise of Object Oriented paradigm of programming, many people tend to design code around their model of the world, not the actual data they are dealing with.

But the code runs on the underlying hardware which has physical constraints. And if we care about the physical constraints of hardware, the hardware will care about the performance our program.

This style of programming where we design our code keeping in mind how our data is accessed and transformed in the hardware is commonly known as Data Oriented Design (DOD). It means that the programmer should think about how the input data will be transformed into output data, rather than focusing on the code design. After all, that's the job of a programmer.

DOD is not a programming paradigm like OOP or procedural programming, because its not a fundamental style of programming. Its just a new name given to a style which has been there for a long time. And that's fine because it will help mainstream adoption of this design style, which is more of a first principles approach.

But why is memory access such a big deal?

Okay, let me explain why memory access matters so much in performance of a software. As I mentioned, there are some physical constraints to memory. Lets understand these constraints in more detail.

In order to fetch data faster, a CPU uses cache memory. A cache memory is typically a small sized memory which is located near CPU to allow it to access data faster. CPUs have different levels of cache (L1, L2 etc.), each cache becomes bigger in size and slower in latency as we move down the hierarchy. But even the bigger caches tend to be very small compared to the main memory. For example, the L2 cache is only about 2MB per core in the below mentioned architecture.

Below is a cache memory hierarchy diagram with access latency in a modern processor architecture- AMD Jaguar (used in PS4 and XBox One):

Memory latency on consoles – Jason Gregory

The latency differences in the above figure are astonishing, especially the difference between L2 and RAM latency. If there is a cache miss at L2, then it would take ~180 CPU cycles more to get the same data from main memory! That takes no time in our world, but it is a LOT of time in CPU world. Here is an animation which shows the relative difference in access latency between caches and main memory.
CacheAnim
Memory access latency – Andreas Fredriksson

Looking at this, one would hope that his program doesn't have to access data from main memory ever. But a program has to work on data. And even if we ignore the initial memory fetch when the program starts, there are a lot of factors which can cause cache misses, like branching, runtime polymorphism, function calls, etc.  But we can be smart about the kind of data structures we use and the sequence in which we access our data to make our program cache friendly.

Ask them memory guys to make RAM faster!

This seems like a reasonable response. If the CPU speed has been increasing at exponential rate in the past few decades, why isn't memory access speed following the same trend? Why are those well paid hardware engineers not doing their job properly? I blame Obama.

Well, actually the main reason for the slow access time is- wires. Its the speed of light (inside a copper wire) which is the bottleneck! The main memory is located physically further from the CPU than the cache is and part of the latency is in the round trip time it takes for electricity to travel between processor and main memory.
And although we cannot do much about the speed of light, we can certainly design our code keeping data locality in mind in order to minimise cache misses.

Guidance for organising your data:

Use Contiguous Data

Use arrays whenever possible. Arrays are stored as contiguous chunks of data which can be brought together in cache. Arrays even beat other data structures which are supposed to be faster theoretically, when the size of data is small. For instance, a linear search through an array of 100 integers will probably beat out a search through a binary tree of 100 integers due to memory accesses, even though the binary tree will most likely require fewer operations. Each tree node would result in a cache miss, whereas the linear search would mostly hit the cache for each lookup.

Use As Much Cache Line As Possible

Memory is read and written in terms of cache line. If a cache line is brought from main memory into the cache, make sure you use it optimally. You should think about all the data you need for an operation and pack it densely. 

Say we have some particle object in a game and we want to update the ones which are alive. The most straight forward answer is something like this:

for (int i = 0; i < numParticles_; i++)
{
if (particles_[i].isActive())
{
particles_[i].update();
}
}
view raw poor_cache.cpp hosted with ❤ by GitHub



The particle object has a flag which tracks if it is alive or not. In our update loop, we check this flag for each particle object. So the particle object with its flag and all other data is brought into the cache. And if the flag is off, then we skip over to the next object, and all the other data which was loaded for the previous object was a waste of cache. And if the array is large and has a lot of inactive particles, we are basically thrashing our cache.

We can do better. One easy solution is to keep the array sorted by the active particles, keeping all the inactive particles on the back of the array. We also keep track of the number of active particles. And if a particle comes to the active state, we swap it with the first inactive particle and increment the active count. This makes the update loop really efficient:

for (int i = 0; i < numActive_; i++)
{
particles[i].update();
}
view raw good_cache.cpp hosted with ❤ by GitHub


So these were some of the common techniques for making your programs crazy fast.

In the next post, we will see how we can create cache friendly multithreaded programs and things we should be careful about. See you in next post!


If you want to get more detailed explanation of Data Oriented Design, check out Mike Acton's talk at CppCon on the same.

You can also read this blog which I found really helpful in understanding these concepts. 



Comments

Popular Posts