How Memory Access Patterns Can Make Or Break Performace (Part 2)

December 02, 2018

How Memory Access Patterns Can Make Or Break Performace (Part 2)

Part 2: Effects of Cache On Multi-Threading

In the last part, we discussed how we can achieve better performance by accessing contiguous chunks of data so that we can take advantage of low latency caches, reducing waiting time of the processor.

In this part, we will see why you should be careful about accessing contiguous memory data when your program is running in parallel on multiple cores.

Example: A Simple Parallel Counter

Let's look at an example of a scalability issue arising due to unoptimised cache operations. You can find this example in detail on Herb Sutter's website.

So what Mr. Herb wanted to do was to count the number of odd elements in a matrix. This is a trivial parallel problem where you can use multiple threads to count odd elements at different parts of the matrix and then just add the number of odd elements found in each thread.

So this is what he did:

// Example 1: Simple parallel version (flawed)

//

int result[P];

// Each of P parallel workers processes 1/P-th

// of the data; the p-th worker records its

// partial count in result[p]

for( int p = 0; p < P; ++p )

  pool.run( [&,p] {

    result[p] = 0;

    int chunkSize = DIM/P + 1;

    int myStart = p * chunkSize;

    int myEnd = min( myStart+chunkSize, DIM );

    for( int i = myStart; i < myEnd; ++i )

      for( int j = 0; j < DIM; ++j )

        if( matrix[i*DIM + j] % 2 != 0 )

          ++result[p];

  } );

// Wait for the parallel work to complete…

pool.join();

// Finally, do the sequential "reduction" step

// to combine the results

odds = 0;

for( int p = 0; p < P; ++p )

  odds += result[p];

Eliminate False Sharing - Herb Sutter

In the above code, P threads are created and 1/P-th of the matrix is assigned to each thread. Since the threads only perform read operation on the matrix, there are no race conditions and it should scale linearly with the number of threads.

But it doesn't. This is the result Herb got on a 24-core machine:

Eliminate False Sharing - Herb Sutter

As you can see from this result, as soon as we use 2 threads, we get a drop in speed by a factor of 0.5. We only start to gain (minor) speedup when the number of threads exceed 14.

Now, lets make a minor change in this code:

// Example 2: Simple parallel version

// (de-flawed using a local variable)

//

int result[P];

// Each of P parallel workers processes 1/P-th

// of the data; the p-th worker records its

// partial count in result[p]

for( int p = 0; p < P; ++p )

   pool.run( [&,p] {

   int count = 0;                       

   int chunkSize = DIM/P + 1;

   int myStart = p * chunkSize;

   int myEnd = min( myStart+chunkSize, DIM );

   for( int i = myStart; i < myEnd; ++i )

      for( int j = 0; j < DIM; ++j )

        if( matrix[i*DIM + j] % 2 != 0 )

          ++count;                      

   result[p] = count;                   

} );

// … etc. as before …

Eliminate False Sharing - Herb Sutter

In the above code, we introduce a local variable 'count' and each thread now updates their odd count in their local variable 'count' instead of the global 'result'. Only in the end, we copy 'count' data to 'result'. Rest remains the same.
But this time, Herb got different result:

Eliminate False Sharing - Herb Sutter

BAM! A perfectly linear speedup with number of threads. But why?

False Sharing

The main memory is read and written in terms of cache lines. So if the byte to be read by the CPU is not present in the cache, the whole cache line where it is present is brought from the main memory to cache. And if a byte on a cache line is written, then the whole cache line is written back to the memory (eventually) and since each core has its own cache, all the other copies of that cache line in other cores are marked as invalid because their data is outdated. Hence the other cores have to read the new copy of cache line from the main memory.

So if Core 1 modifies byte A and Core 2 tries to access byte A+1 which has not been modified at all but since it is on the same cache line as byte A, Core 2 has to read the cache line from the main memory again. This is called false sharing.

This explains the problem in Herb's multi-threaded program. Even if all the threads were updating independent indices of the array 'result', but since they were probably on the same cache line, the other cores had to get the copy again from the main memory.

This problem was solved by introducing a local variable 'counter' which was present on each thread's stack, so probably on different cache lines, hence the concurrency was achieved.

So this is how you can achieve effective concurrency on multithreaded systems and avoid the silent killers of performance. Thanks for reading and now go build some crazy fast programs!

Search This Blog

Between Hardware and Software