Superscalar Processors

This blog was written as part of Computer Architecture course. Content is based on the following paper:

The microarchitecture of superscalar processors by Smith et al. DOI:https://doi.org/10.1109/5.476078

Superscalar processors have the ability to process multiple instructions during the same clock cycle, as a result of which sequential appearing programs are converted into a parallel one. The processing can be classified into the following important stages:

  • Several instructions are fetched and decoded at the same time. To allow for an uninterrupted stream of instructions branch predictors are used to predict the outcome during the fetch stage.
  • Instruction stream is analyzed for true data dependencies and then the instructions are distributed acorss functional units depending on their type.
  • Instructions are executed in parallel, depending on availability of operand data. The dependencies imposed by original program sequence is ignored. This requires adequate hardware resources.
  • Result of an instruction is communicated through memory via load and store instructions.
  • Finally, instruction results are commited in order of original program sequence so that precise interrupts can be served.

Pipelining vs. Superscalar

Pipelining, used to provide instruction level parallelism (ILP) has an initiation rate of one instruction per cycle. Although improvements such as parallel processing units, dynamic instruction scheduling (using Tomasulo’s algorithm) can be employed the architecture can only sustain one instruction per cycle. In contrast, superscalar issues more than one instruction into multiple pipelines breaking the single-instruction-per-cycle bottleneck.

Micro-architecture

At a very high level the micro-architecture/organization of a superscalar processor looks like this:

Figure 1
Figure 1

Micro-architecture of each of the major phases is discussed as below:

  • Instruction Fetch and Branch Prediction. An instruction cache, is used to reduce the latency imposed by instruction fetch process. Due to the delay incurred by branch instructions they need to be predicted. Branch prediction has the following stages:
    • Recognizing the instruction is a branch
    • Determining the branch outcome
    • Computing the branch target
    • Fetching the branch target and redirect fetch if branch is taken

    Instructions executed due to branch predictions are marked to be speculative. On knowing the actual result of the branch, either the instructions speculative status is removed and committed or recovery actions are taken so that process state is not incorrectly modified.

  • Instruction Decoding, Renaming and Dispatching. During this phase true data dependencies are detected (due to RAW hazards) and other register hazards such as WAW or WAR are eliminated via register renaming. Register renaming maps the logical registers (those specified by the program) to some physical register present in the free list.

    An implementation of register file renaming would maintain a physical register file of size equal to logical register file. In addition to that there would a reorder buffer that stores instructions whose results are known and but are not committed yet. Using the free list, new logical registers can be mapped to some entry in physical register file. After execution, logical registers may be mapped to either a physical register or a location in reorder buffer.

  • Instruction Issuing and Parallel Execution. This step determines which instructions can be issued and schedules them into the corresponding functional unit in parallel. Modern day processors use reservation stations to support out-of-order execution and maximize performance. Reservation stations stores the instruction to be executed and valid bit for each source operand indicating whether the data is available or not. On completion of an instruction, the results destination register is compared with source registers of all reservation stations to resolve dependencies.

  • Handling Memory Operations. Various optimization techniques to speed up memory accesses are used. These involve several heirarchy of caches, write-back buffers, miss status handling registers (MSHRs) etc. TLB and MMU caches are used to speed up virtual to physical address translation. A load store queue is maintained to resolve dependencies early and avoid hazards and processor stalls.

  • Committing State. Once result of an instruction is known, it is required to be shown to the programmer. In case of out-of-order processors the results need to committed in original program order to allow precise interrupt handling. To ease this re-order buffers come into play. Instructions are committed only when all levels of speculation is resolved and all previous instructions are committed.

Compiler can be made aware of the superscalar architecture, so that it can place the code such that it maximizes the performance gain. Following is a superscalar organization from DEC Alpha 21164 (try to relate the discussed stages with the organization):

Figure 2
Figure 2

RISC vs. CISC Power Wars

This blog was written as part of Computer Architecture course. Content is based on the following paper:

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures, by Blem et al. DOI: http://dx.doi.org/10.1109/HPCA.2013.6522302

Motivation

Computational landscape is being significanlty shaped by smartphones and tablets. It is no longer exclusive to desktops and servers. Moreover, area and chip design complexity are no longer the primary constraints. Instead energy and power consumption have started to dominate. Most of the tablets and smartphones run on ARM (RISC ISA) while desktops and servers run on x86 (CISC ISA). This is due to lower power consumption of ARM and higher performance provided by x86. However, the markets for both are starting to merge which leads to the question: Does ISA plays a role in performance, power or energy efficiency?

RISC vs CISC

RISC ISA contains fixed length small sized instructions with simple encoding. They perform trivial operations and provides a very few addressing modes. In contrast CISC ISA contains variable length instructions which are complex takes multiple cycles to execute. They tend to provide instructions for encryption, string manipulation etc. Due to the variable sized instructions the decoding stage becomes complex and creates stall in the pipeline. However, a benefit provided by CISC is reduced code size.

To overcome the decoder latency, micro-op caches were introduced which cache the commonly decoded instructions. Also, I-caches are used to minimize the code density. Micro-ops are RISC-like instructions and as a result the code size for RISC and CISC becomes same on decomposition into micro-ops. Micro-ops make the latency per operation to be similar too.

Experimental Results

Given the above comparision of RISC and CISC architectures; the performance, power and energy impacts should be dominated by micro-architecutral enhancements. Following are some of the findings provided by paper:

  • Large performance exists in terms of cycle counts across in-order and out-of-order processors. This is micro-architectural gap and not due to ISA.
  • Combination of instruction counts and mix-findings leads to the conclusion that ISA effects are indistinguishable between x86 and ARM.
  • Micro-architecture has significant impact on performance. Highly accurate branch predictor and large caches allow x86 to perform better. These are ISA independent micro-architecture improvements.
  • The choice of power or performance optimized core designs impacts core power use more than ISA.
  • Due to similarity in power consumption and performance across RISC and CISC enery efficiency is also same.
  • Regardless of ISA or energy efficiency, high performance processors require more power than lower performance processors.

Conclusion

In conclusion, ARM and x86 both are designed based on the performance and energy consumption requirements. There is nothing fundamentally more energy efficient in one ISA class or the other. It is the microarchitecture and design methodologies that really impact power consumption and performance as a result of which choice of ISA no really longer matters. Now a days, ISAs evolve to better support exposing workload-specific semantic information to the execution substrate.

Simultaneous Multithreading

This blog was written as part of Computer Architecture course. Content is based on the following paper:

Simultaneous Multithreading: Maximizing On-Chip Parallelism, by Tullsen et al. DOI: http://doi.acm.org/10.1145/225830.224449

Motivation

Simultaneous MultiThreading (SMT) is a technique that permits several independent hardware threads/contexts to issue instructions to multiple functional units each cycle. It aims to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread in modern superscalar architectures. Since, multiple instruction issues are limited by instruction dependencies and long-latency operations within the single execution threads, they result in horizontal and vertical stalls in the pipeline. Experiments show that there is no dominating cause of this wasted cycles and hence leads to the conclusion — there can be no dominant solution. Due to limitations of specfic latency-hiding techniques any dramatic increase in parallelism should come from a general latency-hiding solution such as multithreading. However, multithreading techniques such as fine grained multithreading fail to eliminate horizontal stalls and others convert most of the vertical stalls into horizontal stalls. Hence, the goal of SMT should be to eliminate both kind of stalls to outperform existing multithreading techniques.

Approaches to SMT

SMT involves scheduling of instructions in each cycle. A fully dynamic approach seems to be too complex to implement (and would also increase the scheduling time). As a result dynamic limited execution is used where dependenc free instructions are issued in-order to a scheduling window. From there instructions can be scheduled onto functional units out-of-order. Instructions are prioritized depending on the amount of waiting time before entering a functional unit. Following are several different machine models of multithreading:

  • Fine grained multithreading. Only one thread is allowed to issue instruction in a cycle. Although a thread can use the entire issue width there may be horizontal stalls due to dependencies. It, however, eliminates all vertical stalls.
  • Full Simultaneous Issue. All threads are allowed to compete for every issue slot, each cycle. This model can be used to examine the potential of SMT.
  • N-issue Simultaneous Multithreading. Here, each thread is limited to issue N instructions in every cycle. This poses a lower bounf on number of threads and might result in horizontal stalls due to their absence.
  • Limited Connection Simultaneous Multithreading. Each hardware thread/context is connected to exactly one of each type of functional unit. The partition of functional units becomes static and results in under utilization of resources due to absence of threads.
  • Single-Chip Multiprocessing. The multiprocessor statically partitions the resources devoting a fixed number of functional units to each thread. This is in contrast to SMT where partitioning is dynamic and changes every cycle. This results in SMT outperforming Single-Chip Multiprocessing.

Experiments show that Full Simultaneous Issue performs the best and performance decreases as the value of N decreases in N-issue SMT. On increasing the number of threads Fine Grained Multithreading seems to saturate in terms of performance. Following are the results:

Limitations and Improvements

Note that increasing the number of threads poses large strains on the cache and TLBs and hampers them. Further the scheduling of instructions also takes more time. The latter has the possibilty to affect misprediction penalty of branch predictors but impact is insignificant in comparison to the benefits gained. The former can be resolved using private caches per thread. Experiments show that private instruction cache and shared data cache result in best performance for high number of threads. The reason might be that private instruction cache make each thread less sensitive to the presence of other threads and shared data cache takes advantage of flexible cache partitioning. Moreover, shared data cache can be beneficial in data-sharing environments.

Remarks and Conclusion

A Simultaneous MultiThreading architecture can achieve 4 times the instruction throughput of a single-threaded wide superscalar with the same issue width and outperforms fine grained multithreading by a factor of 2.

Although paper gives extensive results with different architectural configuration the scheduling logic and its impact on performance were absent. One another idea to explore would be to dynamically partition caches among different hardware threads. This will prevent under utilization of resources and also prevent hampering of caches. Finally, impact of other hardware structures such as predictors on SMT can be explored.