When considering Harvard architecture, as there are separate Instruction and Data memories, cache are also different.
-Instruction cache (i-cache)
-Data cache (d-cache)
Data cache has a hierarchy (L1, L2, L3 or L4)
In a multicore processor system, L1, L2 cache are independent to each core. Usually the data on L3 and L4 are used for cache sharing between cores.
L1 and L2 cache are not shared as it would result in increase of wiring on silicon and eventually teh size of chip. Shaing L1,L2 cache even makes processing slow as the hit rate in cache decreases.
EXAMPLE:
--------
one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
-Instruction cache (i-cache)
-Data cache (d-cache)
Data cache has a hierarchy (L1, L2, L3 or L4)
In a multicore processor system, L1, L2 cache are independent to each core. Usually the data on L3 and L4 are used for cache sharing between cores.
L1 and L2 cache are not shared as it would result in increase of wiring on silicon and eventually teh size of chip. Shaing L1,L2 cache even makes processing slow as the hit rate in cache decreases.
EXAMPLE:
--------
one classic example is to iterate a multidimensional array "inside out":
pseudocode
for (i = 0 to size)
for (j = 0 to size)
do something with ary[j][i]
The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:
for (i = 0 to size)
for (j = 0 to size)
do something with ary[i][j]
It will run much faster.
No comments:
Post a Comment