In this paper the we will talk about multi-core processors, specifically the Intel Core i7 processor. We will discuss what a processor is, what it’s used for in computer architectures, look at some different types of intel i7 processors and see how the cache component is implemented within intel i7 processors.
First let’s discuss what a multi-core processor is and discuss its job in a computer. A multi-core processor is a computer architecture with multiple cores that read and execute program instructions. Since the single processor has multiple cores, it can run multiple instructions on each core at the same time which allows for enhanced performance, reduced power consumption and concurrent processing of multiple tasks.
In November 2008, Intel released the first generation of the Core i7 family. The first generation of included the very popular Nehalem processor and the Westmere (both 1st generation).
Since then Intel has developed 7 more generations of Core i7 processors to the i7 family. The Core i7 8700K revealed in September of 2017 belongs to the Coffee Lake (8th generation) family and has been discussed to be the best gaming processor intel has ever produced. TechRadar, on online publication that focuses of technology has also voted the Intel Core i7-7820X which belongs to the Skylake (6th generation) family to be the best video editing processor on the market in 2018.
The most significant difference brought with the Core i7 was the elimination of the north bridge or memory controller hub. They accomplished this by developing onboard memory and PCI-e controllers. The purpose of a memory controller was to perform DRAM refreshing as well as reading and writing to memory, while the PCI-e controller interfaces with the GPU. In all previous processors intel has developed, the memory and graphics controllers were left up to the motherboard manufacturer to implement. With this change, memory and graphics latency have been both reduced and standardized.
Hyper-Threading Technology is a way of simultaneously multithreading technology introduced by Intel. This technology creates two virtual processing cores for each physical core in the processor. The physical core then powers the virtual cores which allows for the virtual cores to share the processing tasks. Hyper-Threaded processors have the same advantages as a multi-core processor but the multi-core hyper-threaded processor has better performance because of the fact that each core has its own virtual cores, and both virtual cores together exceed the processing power of a single physical core.
Core I7 (Nehalem) Cache and Memory System
Nehalem features a 3-Level cache hierarchy. There is a 64KB Level 1 cache, a 256 Level 2 cache (1 for each core, not shared) and an 8MB Level 2 cache (shared between cores).
The level 1 cache is the same size as Penryn, but it is slower (4 vs 3 cycles). The reason why the L1 cache is slower is because it was gating clock speed as the chip grew in complexity and size. It is estimated that there is a 2-3% performance hit because of the higher latency in the L1 cache in Nehalem.
The level 2 cache is a designed to be a buffer to the L3 cache in order to avoid having all the cores draining bandwidth directly from the L3 cache. The Memory hierarchy of Conroe was comparatively simple so Intel focused on the performance of the shared L2 cache. They determined this was the best solution for this architecture because it was meant for mostly dual-core implementations. However, with the I7 (Nehalem), the engineers decided to start from scratch and came to the conclusion that an L2 cache shared between each core was not suited to a native quad-core architecture. A problem with having different cores is that a core could frequently flush data needed by another core and that involves too many problems in terms implementing internal buses and memory arbitration to provide all cores with enough bandwidth to keep latency sufficiently low. In order to solve this problem, Intel implemented a Level 2 cache for each of the cores. These L2 caches are relatively small since they are dedicated to a single core. They are 256KB, therefore Intel managed to improve its performance greatly. Its latency is reportedly significantly better than Penryn- from 15 cycles (Penryn) to approximately 10 cycles.
The L3 cache is designed to be shared by all the cores and is 8MB in the initial Core i7, although its size varies depending on the number of cores. The significantly large memory in the L3 is for managing communication between cores. This means that multi-threaded applications that require use of all cores will take advantage of the shared L3 cache greatly. The main reasoning for an inclusive L3 cache is that it contains all the data stored in the L1 and L2 caches. This ensures that there is an increase in performance and power consumption because the cache’s hierarchy is set up so that if a core tries to access a data item and it is not present in the Level 3, then the data won’t be located in any other of the other cores’ private caches. Contrariwise, if the data is present then there are 4 bits associated with each line of the cache memory that show whether the data is possibly located in the lower-level cache of another core (and which core). Therefore, if the CPU doesn’t find the required data in the L3 cache then this means that is doesn’t have to look for the data in the L1 or L2 caches because it won’t be there.
In order to gain more perspective on the effectiveness of the Intel Core I7(Nehalem) we will look at previous research which compares its performance to that of other quad-core processors. The paper Performance Evaluation of the Nehalem Quad-core Processor for Scientific Computing compares the performance of the Nehalem to two first-generation quad-core processors, the AMD Opteron 8350 (Barcelona) and the Intel Xeon X7350 (Tigerton). AMD’s Barcelona’s specs include each core having a private 64KB LI cache and a private 512KB L2 cache, and each processor has a shared 2MB L3 cache. For Intel’s Tigerton Intel gave each core a private 64KB LI cache (32KB data + 32KB instruction), while the two cores on each side share a 4MB L2 cache for a total of 8MB L2 cache.
The following table lists the characteristics of each quad core processor.
The authors of this paper compare each of the processors by using different types of memory intensive applications. The following shows the results of comparing the single-core performance between these architectures.
The graph on the left compares the time it took for one iteration of the test loop to run and shows the results between each processor on each specific application. The graph on the right shows the advantage of Nehalem such that a value of 1.0 means that they have the same performance, and a value of 2 means that there is 2x improvement in performance. From the graph we can conclude that Nehalem is about 1.1x to 1.8x faster than Tigerton and 1.6x to 2.9x faster than Barcelona. It is important to note that even though Nehalem has a clock speed of 2.8GHz which is slower than Tigerton at 2.93GHz, it still performs better in all of the different applications. We can conclude that Nehalem achieves a better performance due to improvements in the design of its memory and cache organization and that therefore in this case clock speed is not the sole predictor of performance or even the most definitive one.
...(download the rest of the essay above)