Measurement of Processors Performance report. In the experiment, a testing code was developed in C Programming Language. The design involved filling up each memory hierarchy level with data, then by noting down the time taken to access each level, the ban

University Degree Mathematical and Computer Sciences

Measurement of Processors Performance

Table of Contents

Table of Figures

Introduction

In modern computers, the memory is always divided into multiple levels, which are structured into a memory hierarchy (Figure 1). In the memory hierarchy, each level is distinguished by the response time with the top-most level being the fastest. In most computers, this level is the processor registers which are often accessed in one CPU clock cycle. The next level is the Level 1 (L1) Cache, which is temporally and physically closest to the main processor. The third level is usually the Level 2 (L2) Cache and the forth level is the main memory [1]. We conducted an experiment to measure the data transfer rate between these memory levels and the results are presented and analyzed in this report.

Figure 1 Memory Hierarchy

In the experiment, a testing code was developed in C Programming Language. The design involved filling up each memory hierarchy level with data, then by noting down the time taken to access each level, the bandwidth can be calculated. For the caches, only sequential access was performed and measured, while for the main memory both sequential and random access were measured. The detailed experimental design is described in Section 3 and the results are represented in Section 4. Section 5 of the report analyses and discusses the results and conclusions are developed.

Experiment Overview

Theoretical Background

Computer’s memory is always structured into memory hierarchy in which each level is distinguished by the response time with the top-most level being the fastest. The memory hierarchy in most computer is:

Processor registers – fastest possible access (usually 1 CPU cycle), only hundreds of bytes in size
Level 1 (L1) cache – often accessed in just a few cycles, usually tens of kilobytes
Level 2 (L2) cache – higher latency than L1 by 2× to 10×, often 512 KB or more
Level 3 (L3) cache – higher latency than L2, often 2048 KB or more
Main memory – may take hundreds of cycles, but can be multiple gigabytes. Access times may not be uniform, in the case of a NUMA machine. ...

This is a preview of the whole essay

Processor registers – fastest possible access (usually 1 CPU cycle), only hundreds of bytes in size
Level 1 (L1) cache – often accessed in just a few cycles, usually tens of kilobytes
Level 2 (L2) cache – higher latency than L1 by 2× to 10×, often 512 KB or more
Level 3 (L3) cache – higher latency than L2, often 2048 KB or more
Main memory – may take hundreds of cycles, but can be multiple gigabytes. Access times may not be uniform, in the case of a NUMA machine.
Disk storage – millions of cycles latency if not cached, but very large
Tertiary storage – several seconds latency, can be huge [2]

The basic idea underlying beside the design is filling up each of these memory hierarchy levels by specified data as shown in Figure 2. Then sequential and random accesses to these data are performed and the time taken is noted down to calculate the bandwidth of each level. Following are the two equations used to calculate the cache bandwidth and the main memory bandwidth [3]:

Bandwidth for Cache = processor speed x bus width
Bandwidth for Main Memory = memory speed x bus width x 2 (for double data rate)

Figure 2 Design Strategy

Testing Platform

CPU being tested (Table 1):

Intel Core i7 975 Extreme Edition [4]

Quad Core (4xCores)
45 nm minimum feature size
8 MB L3 cache
CPU clock of 3.3 GHz
External Clock of 133 MHz and 25x multiplier
Microarchitecture of Nehalem

Intel Core i5 750 [5]

Quad Core (4xCores)
45 nm minimum feature size
8 MB L3 cache
CPU clock of 2.66 GHz
External Clock of 133 MHz and 20x multiplier
Microarchitecture of Westmere

Table 1 CPU Comparison

* Values are estimated based on research, no official documentation

Motherboard: Asustek RAMPAGE II EXTREME Intel X58
RAM: Cosair 6GB Triple channel DDR3-1333
HD: Hitachi 1TB SATA2 3Gb/s
Graphic: ATI Radeon HD4870 x 2 2Gb DDR5
Case: Antec Nine Hundred Two
Power: Hyena True-Power 500W, 12V DUAL RAIL
OS: Windows Vista Ultimate

Theoretical Prediction

Hypothesis

For the sequential access to the memory, we expect to observe four distinct levels on the graph, corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory. The reason for this expectation is that the accessing time will increase significantly when there is a transition between different memory hierarchies. [6] For the random access to the memory, a graph of an exponentially decrease in the data transfer rate should be seen, as the current level of memory hierarchy is saturating. Then it should level off at a particular value since the memory size of the current level is exceeded.

Estimation Calculation

Intel Core i7 975 Extreme Edition:

Bandwidth for L1 Cache = processor speed x bus width

= 3300 MHz x 32 bytes = 106,000* MB/s ± 10MB

Bandwidth for L2 Cache = processor speed x bus width

= 3300 MHz x 16 bytes = 52,800* MB/s ± 10MB

Bandwidth for Main Memory = memory speed x bus width x 2

= 133 MHz x 8 bytes x 2 = 2,130* MB/s ± 10MB

*Results were calculated to 3 significant figures

Intel Core i5 750:

Bandwidth for L1 Cache = processor speed x bus width

= 2660 MHz x 16 bytes = 52,800* MB/s ± 10MB

Bandwidth for L2 Cache = processor speed x bus width

= 2660 MHz x 8 bytes = 26,400* MB/s ± 10MB

Bandwidth for Main Memory = memory speed x bus width x 2

= 133 MHz x 8 bytes x 2 = 2,130* MB/s ± 10MB

*Results were calculated to 3 significant figures

Design Methodology

Experimental Achievement

The aim of this experiment is to measure the data transfer rate in each memory hierarchy. In our design, both the sequential access and the random access were being tested. Fetching large block of data sequentially was thought to be more efficient and the maximum possible bandwidth was obtained. By testing the random access, the worst case of bandwidth was able to be found as well.

Experimental Implementation

Create an array with its size large enough to fill up the L1 cache, the L2 cache, L3 cache and the main memory. The size of the array can be set by calling the function malloc().
Loop through the whole array and access each element in the array. To avoid the compiler optimizer, each element is stored in a variable called sum.
Set the initial region of the array to be accessed.
Record the initial system time by calling function my_clk() and store into variable t.
A. Perform the nested loop in which data in the array is read and stored into the variable sum again to overwrite its previous value thus a Sequential Access is done.

B. Similar to A, but another function out of the nested loop is called, rand().This function generates random index for the array thus a Random Access is performed.

Record the new reading of the timer by calling the function clock() again.
Perform subtraction between two sets of time recorded then use the array size to divide this value to obtain the data transfer rate.
Print out the output, which is the data transferred per second.
Increase the size of the array and repeat the whole process again to calculate the data transfer rate for every hierarchy.

Experimental Errors

By reducing the random error in the output, it is ensured that the loop is run for 100000 times because the time taken by a single loop is too small and will be neglected by C program. By repeating the memory access many times an average of the access time can be taken over a relatively large number of “ticks” (the number of ticks/second can be outputted by printing CLOCKS_PER_SEC) [7].

Result

Intel Core i7 975 Extreme Edition:

Figure 3 Graph of Data Transfer Rate versus Array Size for Sequential Access for Core i7 975 EE

Intel Core i5 750 Extreme:

Figure 4 Graph of Data Transfer Rate versus Array Size for Sequential Access for Core i5 750

Figure 5 Average Data Transfer Rate Comparison between Core i7 EE and Core i5 750

For this sequential access test, the final results (Table 2) are consistent with our original hypothesis and show clearly four drops in the graphs corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory. As shown in Figure 3, the Core i7 975 EE processor gives an approximate average bandwidth of 76,000 MB/s for the array sizes smaller than or equal to 256KB (L1 cache), 39,000 MB/s for sizes between 256KB and 1024KB (L2 cache), 17,000 MB/s for sizes between 1024KB and 8192KB (L3 cache) and 5,400 MB/s for sizes greater than 8192KB (Main Memory). For the Core i5 750 processor, it gives an approximate average bandwidth of 68,000 MB/s for the array sizes smaller than or equal to 64KB (L1 cache), 35,000 MB/s for sizes between 64KB and 1024KB (L2 cache), 22,000 MB/s for sizes between 1024KB and 8192KB (L3 cache) and 10,000 MB/s for sizes greater than 8192KB (Main Memory), as shown in Figure 4. The average bandwidth of three different levels of caches for both processors are shown and compared in Figure 5.

Table 2 Different Level Bandwidth Comparison

Discussion

The results presented in Section 4 for the L1 cache, L2 Cache and L3 cache are very close to our initial hypothesis. However, the measured bandwidth for the main memory has a considerable difference from the estimated value. The reason for this observation is unclear. We suggest the possible reason could be the latency of the processors and interference caused by other processes and the operating system. Also, there may not have been enough repetitions of the memory access [8].

Table 3 Comparison of Bandwidth between Core i5 750 and Core i7 975 EE

Figure 6 Comparison of Bandwidth between Core i5 750 and Core i7 975 EE

As shown above, Table 3 gives the overall consideration of all memory hierarchy level bandwidth for both processors. It can be clearly seen that Core i7 975 EE has better overall performance than Core i5 750. However, Core i5 has much faster L3 cache than that of Core i7 975 EE. Figure 6 shows Radar Graph of the comparison between these two processors.

Conclusion and Recommendation

There are four distinct levels of data transfer rate corresponding to accessing the L1 cache, the L2 cache, the L3 cache and the main memory.
All cache results of Core i7 975 EE are consistent with the hypothesis:

L1 cache has a data transfer rate of 77,000 MB/s
L2 cache has a data transfer rate of 39,000 MB/s
L3 cache has a data transfer rate of 17,000 MB/s

All cache results of Core i5 750 are consistent with the hypothesis:

L1 cache has a data transfer rate of 68,000 MB/s
L2 cache has a data transfer rate of 35,000 MB/s
L3 cache has a data transfer rate of 22,000 MB/s

The results of main memory for both processor are different from the estimated value:

Bandwidth of Core i7 975 EE: 10,000 MB/s
Bandwidth of Core i5 750: 5,400 MB/s

Based on the considerable difference between the measured main memory bandwidth and the estimated value, following recommendation s are made:

The testing code should be run in “Safe Mode with Command Prompt” on Windows Vista Ultimate, in order to reduce interference and overheads from the operating system and other processes.
Alternatively this could have been done on Linux using gcc O4 which would run with the maximum optimization [9].
Latency of the processor should be found by using 3rd party software and should be considered for the testing stage [10].
Enough repetitions of the memory access should be performed to obtain more reliable results.

List of References

Comer, D. E. (2005) Essentials of Computer Architecture. Pearson/Prentice Hall, USA (New Jersey)
Wikipedia (2009) from

Retrieved 29 September 2009.

Corei7ee (2009) from

Retrieved 29 September 2009.

Corei750 (2009) from

Retrieved 29 September 2009

A.Swan(2008), MEMORY HIERARCHDESIGN ,from

V.Guistin(2004), Fast Data Dependence Analysis in a Multimedia Vectorizing Compiler,
Proceedings of the 12th Euromicro Symposium on Parallel and Distributed Computing 2004,
PDP 2004, February, 11–13, La Coruna, Spain, pp. 176–183. 2004.

Assoc. Prof. John Morris and Dr. Morteza Biglari-Abhari(2009) “Computer architecture”, Lecture/class university of Auckland, New Zealand. Unpublished.

David A. Patterson, John L. Hennessy(2009) Computer organization and design : the hardware/software interface ,4th ed.Amsterdam ; Boston : Elsevier Morgan Kaufmann, c2009

Nils J. Nilsson(2007), computer architecture - A New Synthesis, Morgan Kaufmann Publishers, 2000.

H.joshon(2006) Software Architecture and Design, from

Retrieved 29 September 2009

Measurement of Processors Performance report. In the experiment, a testing code was developed in C Programming Language. The design involved filling up each memory hierarchy level with data, then by noting down the time taken to access each level, the ban

Introduction

Experiment Overview

Theoretical Background

This is a preview of the whole essay

Testing Platform

Theoretical Prediction

Hypothesis

Estimation Calculation

Design Methodology

Experimental Achievement

Experimental Implementation

Experimental Errors

Result

Discussion

Conclusion and Recommendation

List of References

Document Details

Related Essays

This report will discuss the evolution and history of two RISC Processors,...

Microsoft Access is a relational database management system (DBMS). At the...

Random Access Memory

This report is about the design, program design and implementation of the U...