The original motivation for WaveScalar was to build a decentralized superscalar processor core. Initially, we examined each piece of a superscalar and tried to design a new, decentralized hardware algorithm for it. By decentralizing everything, we hoped we could design a truly scalable superscalar. It soon became apparent that instruction fetch is difficult to decentralize, because, by its very nature, a single program counter controls it. Our response was to make the processor fetch in data-driven rather than program counter-driven order. From there, our superscalar'' processor quickly became a small dataflow machine. The problem then became how to build a fully decentralized dataflow machine, WaveScalar and the WaveCache are the creative extension of this line of inquiry.

# Dataflow

Dataflow machines are perhaps the best studied alternative to von Neumann processors. The first dataflow architectures appeared in the mid to late 70's, and in the late 80's and early 90's there was a notable revival. Dataflow computers execute programs according to the dataflow firing rule, which stipulates that an instruction may execute at any time after its operands are available. Values in a dataflow machine generally carry a tag to distinguish them from other dynamic instances of the same variable. Tagged values usually reside in a specialized memory (the token store) while waiting for an instruction to consume them. There are, of course, many variations on this basic dataflow idea.

There have been two significant problems in the development of dataflow as a general purpose computing technology. The first is that the dataflow work of the late 80's and early 90's made it clear that high performance dataflow machines were difficult to build. Culler et. al. articulated this difficulty as a cost/benefit problem and argued that dataflow machines cannot keep sufficient data near the processor to support substantial parallelism. While the arguments were sound, they were based on the assumption that processing elements are expensive and that getting fast memory near the processing elements is difficult. Current technology allows us to build thousands of processing elements on a die and surround each with a small amount of fast storage. As a result, these arguments are no longer applicable.

The second stumbling block was dataflow's inability to efficiently provide total load/store ordering, the memory model assumed by most programming languages. To avoid this problem dataflow researchers resorted to special dataflow languages. While these languages excelled at expressing parallelism that dataflow machines could exploit, they were impractical, because they also disallowed side effects, mutable data structures, and many other programming constructs that are ubiquitous in languages like C, C++, and Java.

# The WaveScalar ISA

We propose a new approach to dataflow computing, WaveScalar, that provides load/store ordering and addresses the problems discussed in Section~\ref{sec:motivation}. Conceptually, a WaveScalar binary is the dataflow graph of an executable and resides in memory as a collection of \emph{intelligent} instruction words. Each instruction word is intelligent, because it has a dedicated functional unit. In practice, since placing a functional unit at each word of instruction memory is impractical, an intelligent instruction cache, a WaveCache, holds the current working set of instructions and executes them in place.

A WaveScalar executable contains an encoding of the program dataflow graph. In addition to normal RISC-like instructions, WaveScalar provides special instructions for managing control flow. In this respect, WaveScalar is similar to previous dataflow assembly languages.

## Dataflow instructions

Dataflow machines must convert control dependencies into data dependencies. To accomplish this, they explicitly send data values to the instructions that need them instead of broadcasting them via the register file. The potential consumers are known at compile time, but depending on control flow, only a subset of them should receive the values at run-time. There are two solutions to this problem, and different dataflow ISAs have used one or both.

The first solution is a conditional selector, or phi, instruction. These instructions take two input values and a boolean selector input and, depending on the selector, produce one of the inputs on their output. phi instructions are analogous to conditional moves and provide a form of predication. They are desirable because they remove the selector input from the critical path of some computations and therefore increase parallelism. They are also wasteful because they discard the unselected input.

The alternative is a conditional split, or rho instruction. The rho instruction takes an input value and a boolean output selector. It directs the input to one of two possible outputs depending on the selector value, effectively steering data values to the instructions that should receive them. These instructions correspond most directly with traditional branch instructions, and they are required for implementing loops.

The WaveScalar ISA supports both types of instructions, but our toolchain currently only supports rho instructions.

### Waves

The WaveScalar compiler breaks the control flow graph of an application into pieces called waves. Conceptually, a WaveScalar processor executes a wave at a time. The key properties of a wave are that (1) each time it executes, each instruction in the wave executes at most once, (2) the instructions in the wave are partially ordered (there are no loops), and (3) control can only enter at a single point. These properties allow the compiler to reason about memory ordering within a wave.

Formally, a wave is a connected, directed acyclic portion of the control flow graph with a single entrance. The WaveScalar compiler (or binary translator in our case) partitions an application into maximal waves and adds several wave management instructions (see below).

Waves are similar to hyper-blocks, but can be larger, because they can contain control flow joins. This reduces the amount of overhead due to wave management and makes parallelism easier to extract. In addition, simple loop unrolling is sufficient for generating large waves, whereas hyper-block generation requires heuristics for basic block selection and extensive code replication.

### Wave numbers

A significant source of complexity in WaveScalar is that instructions can operate on several instances of data simultaneously. For example, consider a loop. A traditional out-of-order machine can execute multiple iterations simultaneously, because instruction fetch creates a copy of each instruction for each iteration. In WaveScalar, the same processing element handles the instruction for all iterations. Therefore, some disambiguation must occur to ensure that the instruction operates on values from one iteration at a time.

Traditional dataflow machines use tags to identify different dynamic instances. In WaveScalar every data value carries a tag. We aggregate tag management across waves and use wave numbers to differentiate between dynamic waves. A special instruction, waveadvance, manages wave numbers. The waveadvance instruction takes a data value as input, increments the wave number and outputs the original data value with the updated wave number. Because waveadvance is such a simple operation, it can be combined with other instructions. For instance, WaveScalar has an Add-With-Wave-Advance instruction.

At the top of each wave there is a waveadvance node for each of the wave's live input values. These nodes reside at the entrance to the wave and increment the wave numbers for each input value. As they leave the waveadvance instructions, all values have the same wave number, since they all came from the same previous wave. In the case of a loop, the values of one iteration percolate through the loop body, and the back-edges to the top of the loop lead to the waveadvance instructions. These compute the wave number for the next iteration and ensure that each iteration has a different wave number.

A key feature of WaveScalar is that the waveadvance instructions allow wave-number management to be entirely distributed and under software control. This is in contrast to traditional dataflow machines in which tag creation is either partially distributed or completely centralized. In the future, we intend to exploit this fact to optimize WaveScalar binaries by creating application-specific tagging schemes.

### Indirect jumps

Modern systems rely upon object linking and shared libraries, and many programs rely upon indirect function calls. Supporting these constructs requires an additional instruction, indirectsend, with three inputs: a data value (i.e., a function argument), an address, and an offset (which is statically encoded into the instruction). It sends the value to the consumer instruction located at the address plus the offset.

Using this instruction, we can both call a function and return values. Each argument to the function is passed through its own indirectsend instruction. At the start of a function, a set of instructions receives these operands and starts function execution. The caller sends the return address to provide the target address for an indirectsend that returns the function's result. The address of the called function need not be known at compile time. A very similar mechanism allows for indirect jumps.

## Memory ordering

Traditional imperative languages provide the programmer with a model of memory known as total load-store ordering.'' WaveScalar brings load-store ordering to dataflow computing using wave-ordered memory. Wave-ordered memory annotates each memory operation with its location in its wave and its ordering relationships (defined by the control flow graph) with other memory operations in the same wave. As the memory operations execute, these annotations travel with the memory requests and allow the memory system to apply the memory operations in the correct order.

To annotate memory instructions, the WaveScalar compiler statically assigns a unique (within a wave) sequence number to each memory operation by traversing the wave's control flow graph in breadth-first order. Within a basic block, memory operations receive consecutive sequence numbers. By assigning sequence numbers in this way, the compiler ensures that sequence numbers increase along any path through the wave. Next, the compiler labels each memory operation with the sequence numbers of its predecessor and successor memory operations, if they can be uniquely determined. Branches and joins may make this impossible, because they can create multiple successors or predecessors for a single memory operation. In these cases, the compiler uses a special wild-card value, ?', instead. The combination of an instruction's sequence number and the predecessor and successor sequence numbers form a link, which we denote .

The links encode the structure of the wave's control flow graph, and the memory system uses this information to apply operations in the correct order and enforce the load-store ordering the programmer expects. When a memory instruction executes, it sends its link, its wave number (taken from an input value), an address, and data (for a store) to the memory. The memory system uses this information to assemble the loads and stores in the correct order to ensure that there are no gaps in the sequence. This is possible because the current wave number in combination with the memory instruction's sequence number totally orders the memory operations through any traversal of a wave, and, by extension, an application. The memory system uses the predecessor and successor information to detect gaps in the sequence. The memory system can be sure that a gap does not exist, if, for each memory operation, M, in the sequence, either M's succ number matches the sequence number of its next operation, or M's succ is ?' and the next operation's pred field matches M's sequence number.

To ensure that gap detection is always possible, we must enforce the no gap rule: No path through the program may contain a pair of memory operations in which the first operation's succ value and the second operation's pred value are both `?'. If a violation of the rule occurs, the compiler adds a memnoop instructions to remove the ambiguity. These instructions participate in memory ordering but otherwise have no effect. In our experiments, memnoop's are rare (fewer than 3\% of instructions).

Wave-ordered memory is the key to efficiently executing programs written in conventional languages. It allows WaveScalar to separate memory ordering from control flow by succinctly annotating the memory operations with information about their location in the control flow graph. The processing elements are freed from managing implicit dependencies through memory and can treat memory operations just like other instructions. The sequencing information included with memory requests provides a concise summary of the path taken through the program. The memory system can use this summary in a variety of ways. Currently, we assume a wave-ordered store buffer. Alternatively, a speculative memory system could use the ordering to detect misspeculations.

The most exciting aspect of wave-ordered memory is the incorporation of instruction order into the instruction set as a first-class entity. However, the above scheme for assigning sequence information only encodes information about dependencies between memory operations. It is possible to devise a more general scheme that also expresses independence among memory operations (i.e., that two memory operation can proceed in any order). An ISA that incorporates such a scheme could use memory aliasing information from the compiler to expose large amount of memory parallelism to the hardware. Designing such an ISA and building a suitable compiler are the subject of ongoing work.