Dissecting GPU Memory Hierarchy through Microbenchmarking

The supplementary document of

Dissecting GPU Memory Hierarchy through Microbenchmarking

Texture L1 cache
Read-only data cache
TLB
L1 data cache
L2 cache
Global BW
Global latency
Shared BW
Shared latency

Section 4: Cache Structures

Section 4.3 Texture L1 cache and read-only data cache

Texture L1 cache of the Fermi and Kepler device (Fig. 7)

Source Code: fine_grain_texture_L1.cu
Some results on the Fermi device: fermi-3073element-stride1.txt fermi-3080element-8stride.txt fermi-3112element-8stride.txt

Note that the basic unit of the N and s is the size of an integer, namely 4 bytes. For example, N=3072 means that the accessed array size is 3012*4/1024 = 12 KB. The output array length is 4096 (iterations=4096), i.e., we load 4096 array elements no matter how long the visited array is. Because iterations>N, we load the array multiple times.

We explain the process with the result on the Fermi device in the following. At first, we set N to a small value, and gradually increase it. At this time, there is no cache miss. The output memory latency is about 250 cycles, indicating the cache hit latency. When N=3073, some of the data access latencies become around 480, which indicate the cache miss latency. Therefore, the cache size is 12 KB (i.e., N=3072).

Stage 1

Setting: N=3073, s = 1

Analysis: Because we overflow the cache with only 1 element, the cache misses are caused by loading the data mapped to the first element of the cache lines mapped to the first cache set.
Part of the output fermi-3073element-stride1.txt is as followed, where we load the 1024th data to 1040th data one by one.

Next_id	Latency
1025	488.000000
1026	246.000000
1027	250.000000
1028	246.000000
1029	250.000000
1030	246.000000
1031	250.000000
1032	244.000000
1033	488.000000
1034	246.000000
1035	250.000000
1036	246.000000
1037	250.000000
1038	246.000000
1039	250.000000
1040	244.000000
1041	488.000000

There is 1 longer latency every 8 data accesses in the above table, indicating for the cache miss. At this time, the cache miss is caused by loading a new cache line. The cache line size equals: 8 * 4 = 32 bytes.

The replacement policy is LRU

The cache miss pattern deducing from the output file is periodic. For example, in the above table, the access latency of visiting the 1024th data (Next_id=1025) would always be around 480, and cannot be around 250.

Stage 2-1

Setting: N=3080:3104, s = 8

Analysis: We overflow the cache with 1 cache line. At each time, we load the data at the beginning position of a cache line. The cache misses are caused by visiting every data mapped to the first cache set.
Part of the output fermi-3080element-8stride.txt is as followed, where we load the 1024th data to 1159th data with a stride of every 8 elements.

Next_id	Latency
1032	488.000000
1040	472.000000
1048	478.000000
1056	472.000000
1064	250.000000
1072	244.000000
1080	250.000000
1088	246.000000
1096	250.000000
1104	246.000000
1112	250.000000
1120	246.000000
1128	250.000000
1136	246.000000
1144	250.000000
1152	246.000000
1160	486.000000

From the above table, we can find the 2D spatial locality optimized cache structure: four consecutive cache lines are mapped to one cache set. For the traditional cache, we would find a pattern like: each 1 cache miss followed by (number of cache sets-1) cache hits. In the above table, we find a pattern like: every 4 cache misses followed by 12 cache hits.
We increase N from 3080 to 3104, the total number of cache misses are close and the cache miss patterns are similar. This coincides with our deduction that four successive cache lines are mapped to one cache set.

Stage 2-2

Setting: N=3112:3200, s = 8

Analysis: We overflow the cache with 5 cache lines. There are much more cache misses compared with that of N=3104, since data mapped to the second cache are also missed.
Part of the output fermi-3112element-8stride.txt is as followed, where we also load the 1024th data to 1159th data with a stride of every 8 elements. Please compare with the left side table (also represents the memory access process of N=3104, s=8) for the difference.

Next_id	Latency
1032	488.000000
1040	472.000000
1048	476.000000
1056	472.000000
1064	478.000000
1072	472.000000
1080	476.000000
1088	474.000000
1096	250.000000
1104	244.000000
1112	250.000000
1120	246.000000
1128	250.000000
1136	246.000000
1144	250.000000
1152	246.000000
1160	484.000000

In the above table, loading the 1056th-1087th data also results in cache misses, despite the fact that we only increase the array size by one cache line sizes. Accordingly, the 1056th-1087th data are mapped to the 2nd cache set.
Similarly, we can find that 1088th-1112nd data are mapped to the 3rd cache set (N=3144), and 1120th-1144th are mapped to the 4th cache set (N=3176). When N is equal to or larger than 3176 and s is 8, all the memory loading are cache misses. Thus there are totally 4 cache sets.

To summarise the above, we find that the texture L1 cache of Fermi device is 12 KB, 4-set with 8-byte cache line, and the replacement policy is LRU. The number of cache lines in each cache set equals: 12*1024/(32*4)=96. It memory addressing is special, different from that of typical CPU cache.
The structure of the texture L1 cache of Kepler device is the same with that of Fermi, so that the (N,s) configuration at each stage is also the same.

Texture L1 cache of the Maxwell Device

Source Code: fine_grain_Maxwell_texture_L1.cu
The texture L1 cache of the Maxwell device is 24 KB, since the cache miss occurs until N=6145. In stage 1, we set N=6145, s=1, and find the cache line size is 32 bytes and the replacement policy is LRU. In stage 2-1, we set N=6152, s=8, and find the 2D spatial locality optimized cache addressing. In stage2-2, we find that there are still four cache sets.

The read-only data cache of the Kepler device

Source Code: fine_grain_Kepler_readOnly.cu
Some results on the Kepler device: Kepler-3080element-8stride.xlsx Kepler-LineMapping.xlsx

We can follow the same stage as described in the texture L1 cache part, and find the cache size and the cache line size are the same as those of the Kepler texture L1 cache. Four successive cache lines are mapped to one cache set, too. In the following, we only introduce how we get the cache replacement policy and the memory addressing.

The replacement policy is LRU. The memory access process is periodic. We can find this from Kepler-3080element-8stride.xlsx, where we set N=3080, s=8 and iterations=1024. We load the array multiple rounds. At each round, memory access to the ith data is fixed as a cache miss or cache hit.

The memory addressing is rather random. We analyse the memory access patterns with N as 3104, 3136, 3168 and 3200, respectively. Kepler-LineMapping.xlsx plots the memory addressing according to the analysis. Each row represents a cache line, and the four columns represent the memory access pattern. The cells are backgrounded in four colours, representing the 4 cache sets. Specifically, the pink colour stands for the 1st cache set, the blue colour stands for the 2nd, the purple colour stands for the 3rd, and the green colour stands for the 4th cache set. The memory addressing is rather random, not bits-defined.

The read-only data cache of the Maxwell device

Source Code: fine_grain_Maxwell_readOnly.cu

Section 4.4 The translation look-aside buffer (TLB)

The page entry size is 2 MB. The page table miss occurs when both N and s are very large. We deduce the page table miss from the memory access time, which is very long if the page table is missed, much longer than that of L2 cache miss. Based on our brute-force stride size testing, when s=2 MB, the memory access causes the page table miss.

There are two levels of TLB for al the three generations of GPUs: L1 TLB and L2 TLB. We set s=2 MB and increase N. We find a small memory access latency increase at N=34 MB and a bigger increase at N=132 MB, indicating the L1 TLB is 32 MB and L2 TLB is 130 MB, respectively. So there are 16 entries in the L1 TLB and 65 entries in the L2 TLB. The L1 TLB is fully associative, which is explained in Wong2010 paper. We mainly introduce how we find the unequal cache sets of the L2 TLB in the following.

Unequal cache sets in the L2 TLB (Fig. 8-9)

Source Code: fine_grain_TLB.cu
Results based on the output: Fermi_L2TLB_mapping.xlsx Kepler_L2TLB_mapping.xlsx Maxwell_L2TLB_mapping.xlsx

The replacement policy is LRU. The memory access process is periodic, i.e., we load the array multiple rounds and at each round, memory access to the ith data is fixed as a cache miss or cache hit.

The L2 TLB has a large set and six normal cache sets. We increase N from 130 MB to 144MB with s=2 MB, and count the number of missed page entries. Because the cache is LRU replaced, the missed entries caused by each increment belong to a new cache set. We plot the page entry mapping of the Fermi, Kepler and Maxwell device in Fermi_L2TLB_mapping.xlsx, Kepler_L2TLB_mapping.xlsx and Maxwell_L2TLB_mapping.xlsx, respectively. The memory mapping is not bits-defined and the first cache set is larger than the other six sets.

Section 4.5 The L1 data cache

We only discuss the L1 data cache of the Fermi device. The L1 data cache is turned off on the Kepler device by default, and the L1 data cache is unified with the texture L1 cache on the Maxwell device.

Source Code: fine_grain_fermi_L1.cu
Results based on the output: Fermi_L1datacache_15KB-24KB-128bytestride.txt Fermi_L1datacache_Mapping.xlsx

L1 data cache structure of the Fermi device (16 KB) (Fig. 10)

We apply the introduced 2-stage methodology. In the first stage, we set s=1 element and get the cache size, cache line size with N=4097 elements. In the second stage, we vary N from 16 KB to 24 KB with stride s=128 bytes, the output is given in Fermi_L1datacache_15KB-24KB-128bytestride.txt. Based on the output, we draw the memory addressing of the L1 data cache shown in Fermi_L1datacache_Mapping.xlsx, which is not bits-defined, either.

We find that the L1 data cache replacement policy is not LRU because the memory access is aperiodic, of which we illustrate in Fig. 11.

Section 4.6 The L2 cache

In previous benchmark functions, we load some array elements before timing the memory access latency, to avoid cold cache misses. In this benchmark kernel function, we also time the cold cache misses (if any) to observe the pre-fetch mechanism.

Proof for the L2 cache line size is 32 bytes:
Source Code: fine_grain_L2-cold_1GB.cu Result: Kepler_L2_cacheline_stride4byte.xlsx (s = 4 bytes, iterations =4096)
In the output file, every 1 of 8 data is missed, thus the cache line size is 8*4 = 32 bytes.

Proof for the hardware-level pre-fetch:
Source Code: fine_grain_L2-cold_4KB.cu Result: Kepler_L2_prefetch_4KB.txt
In the output file, except the first data loading, all the data are cache hit. There is no cold cache miss so that an hardware-level pre-fetch exists.

Texture L1 cache
Read-only data cache
TLB
L1 data cache
L2 cache
Global BW
Global latency
Shared BW
Shared latency

Section 5: Global Memory

Section 5.1 Global memory throughput

Please refer to the second paragraph of Section 5.1 for the experiment design.

Source Code: global_BW.cu
This is based on the source code provided by the CUDA handbook, available at https://github.com/ArchaeaSoftware/cudahandbook/tree/master/microbench.
Result: Fermi.xlsx Kepler.xlsx Maxwell.xlsx

Section 5.2 Global memory latency

Design for the Kepler and Maxwell device: Pattern_design_kepler_Maxwell.txt
Source code: pattern_kepler.cu
Result: pattern_demo_Kepler.xlsx pattern_demo_Maxwell.xlsx

The design of the Fermi device is almost the same as those of the Kepler and Maxwell device, except the stride size and the array size.
Source code: pattern_fermi.cu
Result: pattern_demo_Fermi.xlsx

Section 6: Shared Memory

Section 6.1 Shared memory throughput

Source code: shared_bandwidth.cu
Each thread copies a number of integers from one shared memory space to another shared memory space and returns the on-device latency cycles. The bandwidth is calculated per SM. We find the smallest launch time and the largest complete time of the warps executed on the same SM. For the bandwidth measured on different SMs under the same experiment setting, we keep the largest bandwidth only. We get the bandwidth data with respect to ILP and active warps per SM on three platforms, as following.
Fermi result: Fermi.txt
Kepler result: Kepler.txt
Maxwell result: Maxwell.txt

Section 6.2 Shared memory latency

Source code: shared_bankconflict.cu repeat.h
repeat.h is provided by the early GPU microbenchmark work on the GTX 280 GPU, which is available at http://www.stuffedcow.net/research/cudabmk.
Result: latency.txt

Texture L1 cache
Read-only data cache
TLB
L1 data cache
L2 cache
Global BW
Global latency
Shared BW
Shared latency