Exploring IBM's Next-Generation Mainframe Processor: The z10 & Its 4.4 GHz Performance | Free Essay Examples

IBM z10: The Next- Generation Mainframe Microprocessor

Dhruvil Mehta, CIN-305093313

Department of Electrical and Electronics Engineering

California State University-Los Angeles

5151 State University Drive, Los Angeles, CA-90032

Email- dhruvilmehta94@gmail.com

Nikunj Chauhan, CIN-305079845

Department of Electrical and Electronics Engineering

California State University-Los Angeles

5151 State University Drive, Los Angeles, CA-90032

Email- nikunjchauhan88@gmail.com

Abstract— The IBM system z10 includes quad-cores—each with private 3-Mbyte cache and integrated accelerators for decimal floating point computation, cryptography and data compression. A separate SMP hub chip gives a shared third level cache and inter connect fabric for multiprocessor scaling. This article focuses on the high frequency design methods used to achieve a 4.4 GHz system and on the pipeline design that shows z10’s CPU performance.

I. INTRODUCTION:-

The IBM System z10 processor is the engine for the up- coming era of IBM System z centralized computer servers. Its new microprocessor core influences driving edge high-frequency innovation additionally keeps up full similarity with the current z/Architecture instruction set architecture. The processor chip consolidates a few uncommon capacity units to quicken particular operations and includes robust hardware fault detection and recovery mechanism. The chip likewise incorporates a memory controller, an I/O bus controller, and a switch that associates every one of the four cores to a shared interface with the SMP center chip. The z10 processor is executed in IBM's 65-nm silicon-on-insulator (SOI) process, has a die size of 454 mm2, contains 994 million transistors, and works at 4.4 GHz in a multiprocessor framework.

II. ARCHITECTURE:-

The z10 processor completely actualizes the IBM z/Architecture—the present era of the mainframe instruction set architecture that goes back to the S/360, which praised its 40th commemoration in 2004. Despite the fact that the architecture has experienced numerous significant expansions during this time, IBM has thoroughly kept up upward binary similarity at the application program level to preserve customer investment and guarantee a smooth move starting with one era then onto the next. For z10, we added more than 50 instructions to z/Architecture, mostly to enhance the productivity of ordered applications. The z10 processor additionally includes support for 1-Mbyte page edges and programming equipment interfaces to enhance cache efficiency.

Figure 1. The z10 processor chip includes four microprocessor cores, each surrounded on three sides by its 3-Mbyte second-level cache. Two coprocessor units, each shared by two cores, provide cryptography and data compression.

III. CORE SUMMERY:-

The z10 microchip center is sorted out into eight units, appeared in Figure 2. The instruction fetch unit (IFU) contains a 64-Kbyte instruction cache, branch prediction logic, instruction-fetching controls, and instruction buffers. This present unit's size is a consequence of the rich branch prediction design required to minimize branch mispredictions in a high-recurrence pipeline over a broad range of programs.

The instruction buffers in the IFU feed the instruction decode unit (IDU) in the center of the core. This logic parses and deciphers right around 900 distinct opcodes characterized in z/Architecture (668 of which are implemented altogether in equipment in z10), identifies dependencies among instructions, forms instruction pairs for superscalar execution when conceivable, and issues guidelines to the operand access and execution logic.

Figure 2. The z10 microprocessor core, organized into eight units. (IFU: Instruction Fetch Unit; IDU: Instruction Decode Unit; LSU: Load-Store Unit; XU: Translation Unit; FXU: Fixed-Point Unit; BFU: Binary Floating-Point Unit; DFU: Decimal Floating-Point Unit; RU: Recovery Unit.)

The load-store unit (LSU) incorporates a 128-Kbyte data cache and handles operand accesses over the wide scope of lengths, modes, and configurations incorporated into z/Architecture, and it supports two quad-word fetches per cycle. It additionally buffers operand store results between instruction execution and completion, and it interfaces with the multiprocessor fabric (through the second-level cache) to keep up the required firmly requested cache intelligence. The LSU is combined with a translation unit (XU), which comprises of an extensive second-level translation look-aside buffer (TLB) and a hardware translation unit. The last handles access register translation and dynamic address translation (DAT) to change over logical addresses to real addresses, including the nested DAT required for operating systems running as visitors under the z/VM (virtual machine) hypervisor.

Three units handle the real direction execution: The fixed-point unit (FXU) performs fixed-point arithmetic, logical, and branch instructions. The FXU executes a large portion of these in a solitary cycle, and in pairs, with a full forwarding system to permit consecutive execution of dependent operations. The binary floating-point unit (BFU) is a multistage pipeline that handles all binary (IEEE-754 agreeable) and hexadecimal (S/360 legacy) floating-point operations. This unit can begin one operation for each cycle, and uses intra-pipeline forwarding of results to minimize pipeline delays between dependent instructions. The BFU likewise executes fixed-point multiplication and division instructions. The decimal floating-point unit (DFU) executes both floating-point (IEEE-754R compliant) and fixed-point (S/360 legacy) decimal operations. Decimal floating-point usefulness initially showed up in the z/Architecture on IBM System z9, which implemented these instructions with a combination of hardware and inner code (millicode); on the z10 processor, these are executed completely in hardware, which enhances

Execution for business applications requiring decimal calculation.

At last, the recovery unit (RU) keeps up a complete duplicate of the processor architected state, ensured by error-correcting code (ECC). This state incorporates all z/Architecture registers and in addition different mode and state registers utilized by hardware and millicode to implement z/Architecture functions. The RU gathers all hardware fault detection signals and regulates hardware recovery activities if these signals show any fault.

IV. HIGH-FREQUENCY DESIGN:-

The most particular element of the IBM z10 core design in respect to its predecessors is the giant leap in working frequency—from 1.7 GHz on System z9 to 4.4 GHz on z10 based on system. Beginning with the S/390 CMOS G4 processor in 1997, IBM mainframe CPU cores have had a process duration of approximately 27 to 29 FO4 (fan-out of 4 inverter delays) and a six-cycle pipeline (numbering from instruction decode through register put-away). Through six eras of systems and silicon innovation, the design team has maintain that cycle size and pipeline depth while including considerable functions, (for example, IEEE-compliant floating-point capability, branch target prediction, full 64-bit architecture extensions, superscalar operation, and cryptography). The outline of the z10 processor, however, started from a clean sheet, going for a far shorter 15-FO4 cycle width, and balancing performance, power, area, and design complexity considerations. This change required advancement in the configuration procedure, pipeline structure and architecture implementation.

A complete high-frequency design infrastructure was key to the accomplishment of the z10 project. As noted before, because the Power6 and z10 designs confronted the same difficulties and were utilizing the same procedure innovation, the two projects could share this establishment, including low-level building blocks, for example, latches and data flow components.

V. PIPELINE:-

Figure 3 demonstrates the pipeline for the z10 core, from instruction decoding through results put-away, for a typical FXU instruction. The IFU feeds instructions into this pipeline, where they are proceed in a program order.

The initial three cycles—D1, D2, and D3—parse and decode the instructions (up to two for per cycle), identify inter-instruction dependencies, and convey the outcomes to the instruction queue (IQ) for execution and the address generation queue (AQ) for storage operand access.

Figure 3. The z10 core pipeline for a typical FXU instruction. (D1-D3: parse, decode, and identify inter-instruction dependencies, and deliver results to the instruction queue and the address generation queue; G1-G3: determine stalls group for superscalar execution; AG, C1, C2, and XF: deliver operands for the instruction (or instruction pair) to the FXU; E1 and P1: generate results and condition code; P2 and P3: resolve conditional branches, write results to register files, results for hardware faults, and forward results to the RU.)

The following three cycles—G1, G2, and G3—determine the required stalls in the middle of instructions and group pairs of instructions for superscalar execution, if conceivable. When the instruction dependencies and downstream pipeline allow, instructions are issued from the IQ and AQ to the FXU and LSU, individually. In conjunction with the issuing of instruction from the AQ, the fundamental access registers and general registers are read from the register files and supplied to the operand access controls.

Once a instruction (or instruction pair) has been issued from the IQ and AQ, it continues through the remainder of the pipeline without further stalls. If some condition, for example, cache or TLB miss, is detected that inhibits the instruction's completion, as opposed to being stalled, it is recycled to the grouping stages (G1 through G3) and reissued. This most generally happens for TLB or cache directory misses on operand access, or for inter-instruction hazards that couldn't be detected at instruction decode time, for example, address based dependencies. This non-stalling design of the next portion of the pipeline avoids the need for global stall signal and enables the high-bandwidth flow of instructions when the instruction sequence allows. As a rule, an instruction can be recycled and is prepared to be reissued when the recycling condition is determined, so that the effective performance is the same as it would be if the hazard had been precisely predicted upstream.

The following four cycles of the pipeline—AG, C1, C2, and XF—are responsible of delivering the operands for the instruction (or instruction pair) to the FXU. Similarly as with prior mainframe CPU design, the pipeline is optimized for instructions that get one or both of their operands from cache, and for the flow of operands from load instructions to subsequent register-operand instructions. This gives register-register, register-storage, and storage-storage instruction types the same pipeline timing, furthermore eliminates of any pipeline latency for forwarding operands from load instructions. Performance studies have demonstrated that this structure gives robust performance for a broad range of programs, especially those improved for the z/Architecture instruction set.

For instructions that require one or more operands from the data cache, or that will store a result to the cache, the AG, C1, and C2 cycles produce the operand address (or addresses) and access the TLB, cache directory, and cache array. The XF cycle is used to align, format, and transfer the operand data from the LSU to the FXU.

During the AG, C1, and C2 cycles, the FXU is get ready for execution by performing extra FXU-specific decoding of the instruction, reading any register operands, and setting up controls for the FXU data flow. The actual execution happens in the E1 cycle, and results can be sent promptly to a dependent FXU instruction or to a dependent address generation. Some execution functions stretch out into the P1 cycle, including condition code generation. At long last, the P2 and P3 cycles resolve conditional branches, compose results to register files, check results for equipment faults, and forward results to the RU.

The generally large number of cycles required for decoding, gathering, and issuing instruction reflects both the unpredictability of the z/Architecture instruction set and a drive to push however much as could be expected of the related hardware complexity into the front end of the pipeline. This minimizes the amount of work required in the subsequent stages, which are much more basic to the execution of a pipelined processor. This turns out to be more obvious through a comparison, shown in Figure 4, of the z10 pipeline to that utilized in its predecessors, from S/390 CMOS G4 through System z9. The total number of cycle from decode through put-away grew from 6 to 14—a proportion marginally bigger than the FO4 proportion—reflecting the expense of extra latch levels in the z10 pipeline. All the development, however, was in the front end of the pipeline and in the put-away cycles. In the pipeline's execution critical core, which includes address generation, operand access, and FXU execution, the z10 pipeline is one and only cycle longer than its predecessors. This is crucial to processor execution because most of the inter-instruction dependencies occur within this scope, and the latencies to determine these dependencies play a expansive part in deciding the pipeline's effectiveness.

VI. SPECIAL-FUNCTION COPROCESSOR

External to the microprocessor core, the z10 processor includes a pair of coprocessor units. Each of these units executes zArchitecture's data compression and cryptographic acceleration functions, which are given to the software as conventional synchronous instructions. Each coprocessor serves two of the microprocessor cores and contains two compression engine (each with a 16-Kbyte local cache), a cryptographic cipher engine, and a cryptographic hash engine. This combination balances the area and power efficiency of a shared coprocessor with the need to minimize the execution effect of sharing in high-utilization workloads. Working together with millicode running in the microprocessor core, the z10 coprocessor can maintain throughput rates of 290 to 960 Mbytes/s for encryption (depending upon the cryptographic protocol in use), up to 240 Mbytes/s for data compression, and up to 8.8 Gbytes/s for expansion of compressed data.

VII. SUMMARY

The z10 processor shares high-frequency design techniques and building blocks with Power6, but has unique core, chip, and multiprocessor designs developed especially for the system z enterprise data server. The z10 microprocessor uses up to 64 – 4.4 GHz quad-core processor. The z10 supports up to 1.5TB of available memory. The z10 is 50% faster than the z9 and 100% faster for CPU intensive work. It is as powerful as 1500 x86 servers but uses 85 percent less power and space. The z10 processor supports the CISC z/Architecture and each core has a 64 Kbyte instruction and 128 Kbyte data L1 cache, 3 Mbyte L2 cache and accelerators for cryptography, data compression, and decimal floating point arithmetic. The z10 supports 894 unique instructions, 75% of which are implemented entirely in hardware. This design gives a basis for expanding the IBM mainframe platform through the following a few generation of silicon innovation. Future designs will expand on this high-frequency establishment to enhance pipeline effectiveness, upgrade hardware-software synergy and scalability, and further optimize execution inside progressively critical power density constraints.

• Rating Table:-

Attribute Rating

What is your confidence in your review? 7

What is the overall merit of the paper (should it be published)? 7

How novel is this research? 8

How sound is the science or technical merit? 8

How interesting is this to the computer architecture community? 9

How important will this paper be over time? 9

Is the writing acceptable? 9

How well does this paper work for a presentation? 8

In above table, I have given ratings of some attributes. In my point of you these rating is totally fair enough to show my work with this research paper. This research paper shows architecture of z10 microprocessor, core design, and also pipeline of z10 microprocessor. As I mentioned in the summary, this product and its techniques has lot of new functions, so it will surely helpful in the future.

VIII. REFERENCES

1. C.F. Webb and J.S. Liptay, “A High-Frequency Custom CMOS S/390 Microprocessor,” IBM J. Research and Development, vol. 41, no. 4 and 5, 1997, pp. 463474.

2. E.M. Schwarz et al., “The Microarchitecture of the IBM eServer z900 Processor,” IBM J. Research and Development, vol. 46, no. 4 and 5, 2002, pp. 381-396.

3. C.L.K. Shum ET AL., “Design and microarchitecture of the IBM System z10 microprocessor”

4. K. E. Plambeck ET AL., “Development and attributes of z/Architecture”

...(download the rest of the essay above)

Discover more:

IBM essays

Essay: Exploring IBM’s Next-Generation Mainframe Processor: The z10 & Its 4.4 GHz Performance

Essay details and download:

Text preview of this essay:

Discover more:

Recommended for you

About this essay:

Essay details and download:

Text preview of this essay:

Discover more:

Recommended for you

About this essay:

Essay Categories: