• No results found

INSTRUCTION PIPELINE

In document NATIONAL OPEN UNIVERSITY OF NIGERIA (Page 118-123)

UNIT 4: MIPS R4000

3.2 INSTRUCTION PIPELINE

With its simplified instruction architecture, the MIPS can achieve very efficient pipelining. It is instructive to look at the evolution of the MIPS pipeline, as it illustrates the evolution of RISC pipelining in general.

The initial experimental RISC systems and the first generation of commercial RISC processors achieve execution speeds that approach one instruction per system clock cycle.

simultaneously. A super pipelined architecture is one that makes use of more, and more fine-grained, pipeline stages. With more stages, more instructions can be in the pipeline at the same time, increasing parallelism.

Both approaches have limitations. With superscalar pipelining, dependencies between instructions in different pipelines can slow down the system. Also, overhead logic is required to coordinate these dependencies. With superpipelining, there is overhead associated with transferring instructions from one stage to the next.

Chapter 14 is devoted to a study of superscalar architecture. The MIPS 84000 is a good example of a RISC-based superpipeline architecture.

MIPS R3000 Five-Stage Pipeline Simulator

Figure 13.10a shows the instruction pipeline of the 83000. In the 83000, the pipeline advances once per clock cycle. The MIPS compiler is able to reorder instructions to fill delay slots with code 70 to 90% of the time. All instructions follow the same sequence of five pipeline stages:

• Instruction fetch

• Source operand fetch from register.

• Data memory reference

• Write back into register file

As illustrated in Figure 3.1.5a, there is not only parallelism due to pipelining but also parallelism within the execution of a single instruction. The 60-ns clock cycle is divided into two 30-ns stages. The external instruction and data access operations to the cache each require 60 ns, as do the major internal operations (OP, DA, IA). Instruction decode is a simpler operation, requiring only a single 30-ns stage, overlapped with register fetch in the same instruction. Calculation of an address for a branch instruction also overlaps instruction decode and register fetch, so that a branch at instruction i can address the ICACHE access of instruction i + 2. Similarly, a load at instruction i fetches data that are immediately used by the OP of instruction i + 1, while an ALU/shift result gets passed directly into instruction 1 with no delay. This tight coupling between instructions makes for a highly efficient pipeline.

In detail, then, each clock cycle is divided into separate stages, denoted as 01 and 02. The functions performed in each stage are summarized in Table 3.1.5a.

The 84000 incorporates a number of technical advances over the 83000. The use of more advanced technology allows the clock cycle time to be cut in half, to 30 ns, and for the access time to the register file to be cut in half. In addition, there is greater density on the chip, which enables the instruction and data caches to be incorporated on the chip. Before looking at the final 84000 pipeline, let us consider how the 83000 pipeline can be modified to improve performance using 84000 technology.

Figure 3.5.b shows a first step. Remember that the cycles in this figure are half as long as those in Figure 3.5.b. Because they are on the same chip, the instruction

and data cache stages take only half as long; so they still occupy only one clock cycle.

Again, because of the speedup of the register file access, register read and write still occupy only half of a clock cycle.

Because the R4000 caches are on-chip, the virtual-to-physical address translation can delay the cache access. This delay is reduced by implementing virtually indexed caches and going to a parallel cache access and address translation. Figure 3.1.5c shows the optimized R3000 pipeline with this improvement. Because of the compression of events, the data cache tag check is performed separately on the next cycle after cache access. This check determines whether the data item is in the cache.

In a super pipelined system, existing hardware is used several times per cycle by inserting pipeline registers to split up each pipe stage. Essentially, each super pipeline stage operates at a multiple of the base clock frequency, the multiple depending on the degree of super pipelining. Tke R4000 technology has the speed and density to permit super pipelining of degree 2. Figure 13.11 a shows the optimized R3000 pipeline using this super pipelining. Note that this is essentially the same dynamic structure as Figure 3.1.5c

Further improvements can be made. For the R4000, a much larger and specialized adder was designed. This makes it possible to execute ALU operations at twice the rate.

Instruction fetch first half: Virtual address is presented to the instruction cache and the translation lookaside buffer.

Theoretical R3000 and Actual R4000 Superpipelines

Instruction fetch second half:

Instruction cache outputs the instruction and the TLB generates the physical address.

Register file: Three activities occur in parallel:

• Instruction is decoded and check made for interlock conditions (i.e., this instruction depends on the result of a preceding instruction).

• Instruction cache tag check is made.

• Operands are fetched from the register file.

Instruction execute: One of three activities can occur:

• If the instruction is a register-to-register operation, the ALU performs the arithmetic or logical operation.

• If the instruction is a load or store, the data virtual address is calculated.

• If the instruction is a branch, the branch target virtual address is calculated and branch conditions are checked.

IF = Instruction fetch first half DC = Data cache

IS = Instruction fetch second half DF = Data cache first half RF = Fetch operands from DS = Data cache second half EX = Instruction execute TC = Tag check

IC = Instruction cache

Write back: Instruction result is written back to register file.

4.0 CONCLUSION

This unit has motivated the key characteristics of RISC machines:

1. A limited instruction set with a fixed format

2. A large number of registers or the use of a compiler that optimizes register usage and 3. An emphasis optimizing the instruction pipeline

5.0 SUMMARY

A RISC instruction set architecture also lends itself to the delayed branch technique, in which branch instructions are rearranged with other instruction to improve pipeline efficiency.

In document NATIONAL OPEN UNIVERSITY OF NIGERIA (Page 118-123)