-
Processor registers – fastest possible access (usually 1 CPU cycle), only hundreds of bytes in size
-
Level 1 (L1) cache – often accessed in just a few cycles, usually tens of kilobytes
-
Level 2 (L2) cache – higher latency than L1 by 2× to 10×, often 512 KB or more
-
Level 3 (L3) cache – higher latency than L2, often 2048 KB or more
-
Main memory – may take hundreds of cycles, but can be multiple gigabytes. Access times may not be uniform, in the case of a NUMA machine.
-
Disk storage – millions of cycles latency if not cached, but very large
-
Tertiary storage – several seconds latency, can be huge [2]
The basic idea underlying beside the design is filling up each of these memory hierarchy levels by specified data as shown in Figure 2. Then sequential and random accesses to these data are performed and the time taken is noted down to calculate the bandwidth of each level. Following are the two equations used to calculate the cache bandwidth and the main memory bandwidth [3]:
-
Bandwidth for Cache = processor speed x bus width
-
Bandwidth for Main Memory = memory speed x bus width x 2 (for double data rate)
Figure 2 Design Strategy
Testing Platform
-
CPU being tested (Table 1):
-
Intel Core i7 975 Extreme Edition [4]
-
Quad Core (4xCores)
-
45 nm minimum feature size
-
8 MB L3 cache
-
CPU clock of 3.3 GHz
-
External Clock of 133 MHz and 25x multiplier
-
Microarchitecture of Nehalem
-
Quad Core (4xCores)
-
45 nm minimum feature size
-
8 MB L3 cache
-
CPU clock of 2.66 GHz
-
External Clock of 133 MHz and 20x multiplier
-
Microarchitecture of Westmere
Table 1 CPU Comparison
* Values are estimated based on research, no official documentation
-
Motherboard: Asustek RAMPAGE II EXTREME Intel X58
-
RAM: Cosair 6GB Triple channel DDR3-1333
-
HD: Hitachi 1TB SATA2 3Gb/s
-
Graphic: ATI Radeon HD4870 x 2 2Gb DDR5
-
Case: Antec Nine Hundred Two
-
Power: Hyena True-Power 500W, 12V DUAL RAIL
-
OS: Windows Vista Ultimate
Theoretical Prediction
Hypothesis
For the sequential access to the memory, we expect to observe four distinct levels on the graph, corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory. The reason for this expectation is that the accessing time will increase significantly when there is a transition between different memory hierarchies. [6] For the random access to the memory, a graph of an exponentially decrease in the data transfer rate should be seen, as the current level of memory hierarchy is saturating. Then it should level off at a particular value since the memory size of the current level is exceeded.
Estimation Calculation
-
Intel Core i7 975 Extreme Edition:
Bandwidth for L1 Cache = processor speed x bus width
= 3300 MHz x 32 bytes = 106,000* MB/s ± 10MB
Bandwidth for L2 Cache = processor speed x bus width
= 3300 MHz x 16 bytes = 52,800* MB/s ± 10MB
Bandwidth for Main Memory = memory speed x bus width x 2
= 133 MHz x 8 bytes x 2 = 2,130* MB/s ± 10MB
*Results were calculated to 3 significant figures
Bandwidth for L1 Cache = processor speed x bus width
= 2660 MHz x 16 bytes = 52,800* MB/s ± 10MB
Bandwidth for L2 Cache = processor speed x bus width
= 2660 MHz x 8 bytes = 26,400* MB/s ± 10MB
Bandwidth for Main Memory = memory speed x bus width x 2
= 133 MHz x 8 bytes x 2 = 2,130* MB/s ± 10MB
*Results were calculated to 3 significant figures
Design Methodology
Experimental Achievement
The aim of this experiment is to measure the data transfer rate in each memory hierarchy. In our design, both the sequential access and the random access were being tested. Fetching large block of data sequentially was thought to be more efficient and the maximum possible bandwidth was obtained. By testing the random access, the worst case of bandwidth was able to be found as well.
Experimental Implementation
-
Create an array with its size large enough to fill up the L1 cache, the L2 cache, L3 cache and the main memory. The size of the array can be set by calling the function malloc().
-
Loop through the whole array and access each element in the array. To avoid the compiler optimizer, each element is stored in a variable called sum.
- Set the initial region of the array to be accessed.
-
Record the initial system time by calling function my_clk() and store into variable t.
-
A. Perform the nested loop in which data in the array is read and stored into the variable sum again to overwrite its previous value thus a Sequential Access is done.
B. Similar to A, but another function out of the nested loop is called, rand().This function generates random index for the array thus a Random Access is performed.
-
Record the new reading of the timer by calling the function clock() again.
- Perform subtraction between two sets of time recorded then use the array size to divide this value to obtain the data transfer rate.
- Print out the output, which is the data transferred per second.
- Increase the size of the array and repeat the whole process again to calculate the data transfer rate for every hierarchy.
Experimental Errors
By reducing the random error in the output, it is ensured that the loop is run for 100000 times because the time taken by a single loop is too small and will be neglected by C program. By repeating the memory access many times an average of the access time can be taken over a relatively large number of “ticks” (the number of ticks/second can be outputted by printing CLOCKS_PER_SEC) [7].
Result
Intel Core i7 975 Extreme Edition:
Figure 3 Graph of Data Transfer Rate versus Array Size for Sequential Access for Core i7 975 EE
Intel Core i5 750 Extreme:
Figure 4 Graph of Data Transfer Rate versus Array Size for Sequential Access for Core i5 750
Figure 5 Average Data Transfer Rate Comparison between Core i7 EE and Core i5 750
For this sequential access test, the final results (Table 2) are consistent with our original hypothesis and show clearly four drops in the graphs corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory. As shown in Figure 3, the Core i7 975 EE processor gives an approximate average bandwidth of 76,000 MB/s for the array sizes smaller than or equal to 256KB (L1 cache), 39,000 MB/s for sizes between 256KB and 1024KB (L2 cache), 17,000 MB/s for sizes between 1024KB and 8192KB (L3 cache) and 5,400 MB/s for sizes greater than 8192KB (Main Memory). For the Core i5 750 processor, it gives an approximate average bandwidth of 68,000 MB/s for the array sizes smaller than or equal to 64KB (L1 cache), 35,000 MB/s for sizes between 64KB and 1024KB (L2 cache), 22,000 MB/s for sizes between 1024KB and 8192KB (L3 cache) and 10,000 MB/s for sizes greater than 8192KB (Main Memory), as shown in Figure 4. The average bandwidth of three different levels of caches for both processors are shown and compared in Figure 5.
Table 2 Different Level Bandwidth Comparison
Discussion
The results presented in Section 4 for the L1 cache, L2 Cache and L3 cache are very close to our initial hypothesis. However, the measured bandwidth for the main memory has a considerable difference from the estimated value. The reason for this observation is unclear. We suggest the possible reason could be the latency of the processors and interference caused by other processes and the operating system. Also, there may not have been enough repetitions of the memory access [8].
Table 3 Comparison of Bandwidth between Core i5 750 and Core i7 975 EE
Figure 6 Comparison of Bandwidth between Core i5 750 and Core i7 975 EE
As shown above, Table 3 gives the overall consideration of all memory hierarchy level bandwidth for both processors. It can be clearly seen that Core i7 975 EE has better overall performance than Core i5 750. However, Core i5 has much faster L3 cache than that of Core i7 975 EE. Figure 6 shows Radar Graph of the comparison between these two processors.
Conclusion and Recommendation
- There are four distinct levels of data transfer rate corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory.
- All cache results of Core i7 975 EE are consistent with the hypothesis:
-
L1 cache has a data transfer rate of 77,000 MB/s
-
L2 cache has a data transfer rate of 39,000 MB/s
-
L3 cache has a data transfer rate of 17,000 MB/s
- All cache results of Core i5 750 are consistent with the hypothesis:
-
L1 cache has a data transfer rate of 68,000 MB/s
-
L2 cache has a data transfer rate of 35,000 MB/s
-
L3 cache has a data transfer rate of 22,000 MB/s
- The results of main memory for both processor are different from the estimated value:
-
Bandwidth of Core i7 975 EE: 10,000 MB/s
-
Bandwidth of Core i5 750: 5,400 MB/s
Based on the considerable difference between the measured main memory bandwidth and the estimated value, following recommendation s are made:
- The testing code should be run in “Safe Mode with Command Prompt” on Windows Vista Ultimate, in order to reduce interference and overheads from the operating system and other processes.
-
Alternatively this could have been done on Linux using gcc O4 which would run with the maximum optimization [9].
-
Latency of the processor should be found by using 3rd party software and should be considered for the testing stage [10].
- Enough repetitions of the memory access should be performed to obtain more reliable results.
List of References
-
Comer, D. E. (2005) Essentials of Computer Architecture. Pearson/Prentice Hall, USA (New Jersey)
-
Wikipedia (2009) from
Retrieved 29 September 2009.
-
Corei7ee (2009) from
Retrieved 29 September 2009.
-
Corei750 (2009) from
Retrieved 29 September 2009
- A.Swan(2008), MEMORY HIERARCHDESIGN ,from
.
- V.Guistin(2004), Fast Data Dependence Analysis in a Multimedia Vectorizing Compiler,
Proceedings of the 12th Euromicro Symposium on Parallel and Distributed Computing 2004,
PDP 2004, February, 11–13, La Coruna, Spain, pp. 176–183. 2004.
- Assoc. Prof. John Morris and Dr. Morteza Biglari-Abhari(2009) “Computer architecture”, Lecture/class university of Auckland, New Zealand. Unpublished.
- David A. Patterson, John L. Hennessy(2009) Computer organization and design : the hardware/software interface ,4th ed.Amsterdam ; Boston : Elsevier Morgan Kaufmann, c2009
- Nils J. Nilsson(2007), computer architecture - A New Synthesis, Morgan Kaufmann Publishers, 2000.
-
H.joshon(2006) Software Architecture and Design, from
Retrieved 29 September 2009