A complete single-cycle RISC processor implementation in Verilog featuring an optimized ALU with advanced arithmetic algorithms
This project implements a fully functional 32-bit RISC processor with a custom instruction set (SimpleRISC ISA). The processor features a high-performance ALU utilizing Brent-Kung parallel-prefix adders, Radix-4 Booth multiplication with CSA trees, and non-restoring division, along with a complete Python-based assembler toolchain for program development and verification.
- ποΈ Complete Processor Core: Full single-cycle RISC implementation (Fetch β Decode β Execute β Memory β Writeback)
- β‘ Optimized ALU: Advanced arithmetic algorithms (Brent-Kung, Booth, Non-Restoring Division)
- π 21 Instructions: Arithmetic, logic, shifts, memory operations, branches, function calls
- π§ 16 Registers: General purpose (r0-r13) + Stack Pointer (SP/r14) + Return Address (RA/r15)
- π οΈ Python Toolchain: Custom assembler for ASM β HEX conversion
- β Fully Verified: Comprehensive testbenches and assembly test programs
π‘ Key Innovation: High-performance arithmetic unit balancing speed, area, and complexity while maintaining full ISA compatibility
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ-ββββ
β SimpleRISC Processor Core β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β β β β β β
β β FETCH ββββ>β DECODE ββββ>β EXECUTE ββββ>β MEMORY β β
β β β β β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β β
β β β β β β
β v v v v β
β βββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Instr. β β Control β β ALU β β Data β β
β β Memory β β Unit β β Optimizedβ β Memory β β
β β (IMEM) β β β β β β (DMEM) β β
β βββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β
β v v v β
β βββββββββββββββββββββββββββββββββββ β
β β Register File (16 regs) β β
β β r0-r13, SP(r14), RA(r15) β β
β βββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Program Counter (PC) & Control Logic β β
β β Branch Target Calculation | Return Address β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ-β
Each instruction completes in one clock cycle through five conceptual stages:
- Fetch (IF): Read instruction from instruction memory using PC
- Decode (ID): Extract opcode, registers, immediate values; read register file
- Execute (EX): Perform operation in ALU (arithmetic, logic, address calculation)
- Memory (MEM): Access data memory for load/store operations
- Writeback (WB): Write result back to register file
PC β IMEM β Instruction
β
[Decode Fields]
β
Control Unit β Control Signals
β
Register File β RS1, RS2
β
[Operand Selection: Reg vs Imm]
β
ALU β Result
β
[Memory Access if LD/ST]
β
[WB Mux: ALU vs MEM]
β
Register File β Write Data
Branch Logic β PC Next (PC+4 vs Branch Target)
β 32-bit Architecture
- 32-bit data path and address space
- Word-aligned memory access
- 4GB addressable memory
β 16 General-Purpose Registers
- r0-r13: General purpose
- r14 (SP): Stack pointer
- r15 (RA): Return address (link register)
β 21 Instructions
- 9 Arithmetic/Logic operations
- 3 Shift operations
- 2 Data movement operations
- 2 Memory operations
- 5 Control flow operations
β Single-Cycle Execution
- All instructions complete in one cycle
- Simplified control logic
- Predictable timing
β Harvard Architecture
- Separate instruction and data memories
- Simultaneous fetch and memory access
- No structural hazards
π§ 14 Operations (4-bit opcode):
- Addition, Subtraction (Brent-Kung adder)
- Multiplication (Radix-4 Booth + CSA tree)
- Division, Modulus (Non-restoring algorithm)
- Logical: AND, OR, XOR, NOT
- Shifts: SLL, SRL, SRA (Barrel shifters)
- Comparison: SLT (Set if Less Than)
- Data movement: PASS
module imem (
input [29:0] addr, // Word-aligned address (PC[31:2])
output [31:0] instr // 32-bit instruction
);Features:
- Read-only memory for program storage
- Initialized from hex file during simulation
- Word-addressed (bottom 2 bits of PC ignored)
- Synchronous/asynchronous read (implementation dependent)
module control_unit (
input [4:0] op, // 5-bit opcode
input Ibit, // Immediate flag
output [3:0] alu_op, // ALU operation
output use_imm, // Use immediate vs register
output mem_read, // Memory read enable
output mem_write, // Memory write enable
output wb_en, // Register writeback enable
output wb_from_mem, // Writeback from memory
// Branch control signals...
);Functionality:
- Decodes 5-bit instruction opcode
- Generates control signals for all processor components
- Handles special cases (MOV, NOT, CMP, branches)
- Maps instruction opcodes to ALU operations
Instruction β ALU Mapping:
OP_ADD β ALU_ADD
OP_MUL β ALU_MUL
OP_LD β ALU_ADD (address calculation)
OP_BEQ β Set branch flags (no ALU operation)module regfile (
input clk,
input we, // Write enable
input [3:0] rs1, rs2, // Source registers
input [3:0] rd, // Destination register
input [31:0] wdata, // Write data
output [31:0] rdata1, // RS1 data
output [31:0] rdata2 // RS2 data
);Features:
- 16 Γ 32-bit registers
- Dual read ports (RS1, RS2)
- Single write port (RD)
- Synchronous write, asynchronous read
- Special registers:
- r14 (SP): Stack operations
- r15 (RA): Function return addresses
The heart of the processor - See ALU Design Deep Dive for details.
module alu (
input [31:0] a, b, // Operands
input [3:0] op, // Operation selector
output [31:0] y, // Result
output zero // Zero flag
);Sub-components:
bk_adder32: Brent-Kung parallel-prefix addermul_tree32: Radix-4 Booth multiplier with CSA treediv_nonrestoring32_bk: Non-restoring dividersll32,srl32,sra32: Barrel shifters- Logic gates for AND, OR, XOR, NOT
module dmem (
input clk,
input re, // Read enable
input we, // Write enable
input [31:0] addr, // Address
input [31:0] wdata, // Write data
output [31:0] rdata // Read data
);Features:
- Read/write data memory
- Synchronous access (on clock edge)
- Word-addressed
- Stores program variables and stack
module immu (
input [17:0] imm18, // 18-bit immediate from instruction
output [31:0] immx // 32-bit sign-extended immediate
);Functionality:
- Sign-extends 18-bit immediate to 32 bits
- Used for:
- Immediate arithmetic (ADDI, SUBI, etc.)
- Memory offsets (LD, ST)
- Small constants
module branch_target (
input [26:0] off27, // 27-bit offset
input [31:0] pc, // Current PC
output [31:0] target // Branch target address
);Functionality:
- Calculates:
target = PC + (sign_extend(off27) << 2) - Supports PC-relative addressing
- Used for: BEQ, BGT, B, CALL
All instructions are 32 bits:
βββββββββββ¬ββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββββββββββββββββ
β OP[4:0] β I β RD β RS1 β RS2/ β IMM/OFFSET β
β β β [3:0]β [3:0]β [3:0]β β
βββββββββββ΄ββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββββββββββββββββ
31 27 26 25 22 21 18 17 14 13 0
Fields:
- OP[4:0]: 5-bit operation code
- I: Immediate flag (1 = immediate, 0 = register)
- RD[3:0]: Destination register
- RS1[3:0]: Source register 1
- RS2[3:0]: Source register 2 (if I=0)
- IMM[17:0]: 18-bit immediate (if I=1), sign-extended
- OFFSET[26:0]: Branch offset (for branches)
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
ADD |
00000 |
R/I | Add | add r1, r2, r3 β r1 = r2 + r3 |
SUB |
00001 |
R/I | Subtract | sub r1, r2, r3 β r1 = r2 - r3 |
MUL |
00010 |
R/I | Multiply | mul r1, r2, r3 β r1 = r2 Γ r3 |
DIV |
00011 |
R/I | Divide | div r1, r2, r3 β r1 = r2 Γ· r3 |
MOD |
00100 |
R/I | Modulus | mod r1, r2, r3 β r1 = r2 % r3 |
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
AND |
00110 |
R/I | Bitwise AND | and r1, r2, r3 β r1 = r2 & r3 |
OR |
00111 |
R/I | Bitwise OR | or r1, r2, r3 β r1 = r2 | r3 |
NOT |
01000 |
R/I | Bitwise NOT | not r1, r2 β r1 = ~r2 |
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
LSL |
01010 |
R/I | Logical Shift Left | lsl r1, r2, 4 β r1 = r2 << 4 |
LSR |
01011 |
R/I | Logical Shift Right | lsr r1, r2, 4 β r1 = r2 >> 4 |
ASR |
01100 |
R/I | Arithmetic Shift Right | asr r1, r2, 4 β r1 = r2 >>> 4 |
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
MOV |
01001 |
R/I | Move | mov r1, r2 β r1 = r2 |
CMP |
00101 |
R/I | Compare (sets flags) | cmp r1, r2 β flags = (r1 vs r2) |
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
LD |
01110 |
I | Load | ld r1, r2, 100 β r1 = mem[r2+100] |
ST |
01111 |
I | Store | st r1, r2, 100 β mem[r2+100] = r1 |
Note: Store encoding is special - the value to store is in RD field, not RS2.
| Instruction | Opcode | Format | Description | Example |
|---|---|---|---|---|
BEQ |
10000 |
Branch | Branch if Equal | beq r1, r2, label β if (r1 == r2) goto label |
BGT |
10001 |
Branch | Branch if Greater | bgt r1, r2, label β if (r1 > r2) goto label |
B |
10010 |
Branch | Unconditional Branch | b label β goto label |
CALL |
10011 |
Branch | Function Call | call func β RA=PC+4; goto func |
RET |
10100 |
Branch | Return | ret β goto RA |
NOP |
01101 |
- | No Operation | nop β do nothing |
Register-Register (R-type):
add r1, r2, r3 # r1 = r2 + r3Register-Immediate (I-type):
addi r1, r2, 100 # r1 = r2 + 100Memory Access:
ld r1, r2, 100 # r1 = mem[r2 + 100] (base + offset)
st r1, r2, 100 # mem[r2 + 100] = r1PC-Relative Branch:
beq r1, r2, label # if (r1 == r2) PC = PC + offsetThe ALU is the performance-critical component of the processor, implementing all arithmetic and logical operations. It utilizes advanced algorithms to optimize speed, area, and power.
βββββββββββββββββββββββββββββββ
a[31:0] βββββββ>β β
β ALU Core β
b[31:0] βββββββ>β ββββ> y[31:0]
β - Brent-Kung Adder β
op[3:0] βββββββ>β - Booth Multiplier ββββ> zero
β - Non-Restoring Divider β
β - Barrel Shifters β
β - Logic Gates β
β β
βββββββββββββββββββββββββββββββ
Used for: ADD, SUB (via 2's complement)
Algorithm:
Stage 1: Pre-processing (Generate & Propagate)
P[i] = a[i] XOR b[i]
G[i] = a[i] AND b[i]
Stage 2: Prefix Tree (logβ(32) = 5 levels)
Level 1: Combine adjacent pairs
Level 2: Combine groups of 4
Level 3: Combine groups of 8
Level 4: Combine groups of 16
Level 5: Combine full width
Stage 3: Post-processing (Final Sum)
Sum[i] = P[i] XOR Carry[i-1]
Complexity:
- Time: O(log n) = 5 stages
- Area: O(n log n) gates
- Critical path: ~1.2 ns
Advantages:
- Balanced fan-out
- Low wire complexity
- Scalable to larger widths
- Faster than ripple-carry by ~6Γ
Subtraction Implementation:
wire [31:0] b_neg = ~b + 1; // Two's complement
// Use same adder: a + b_negUsed for: MUL
Phase 1: Sign Handling
sign_a = a[31]
sign_b = b[31]
mag_a = sign_a ? (~a + 1) : a // Absolute value
mag_b = sign_b ? (~b + 1) : b
result_sign = sign_a XOR sign_bPhase 2: Booth Encoding
Scans multiplier in overlapping 3-bit windows, reducing 32 partial products to 16:
| Window[2:0] | Partial Product |
|---|---|
| 000, 111 | 0 |
| 001, 010 | +multiplicand |
| 011 | +2Γmultiplicand |
| 100 | -2Γmultiplicand |
| 101, 110 | -multiplicand |
Phase 3: CSA Tree Reduction
16 partial products
β (3:2 CSA)
11 operands
β (3:2 CSA)
8 operands
β (3:2 CSA)
6 operands
β (3:2 CSA)
4 operands
β (3:2 CSA)
3 operands
β (3:2 CSA)
2 operands (sum + carry)
β (CPA - final addition)
64-bit product (take lower 32 bits)
Phase 4: Sign Restoration
product = result_sign ? (~intermediate + 1) : intermediateComplexity:
- Time: ~12-15 logic levels (CSA tree + final CPA)
- Area: Moderate (16 partial products + tree)
- Critical path: ~8-10 ns
Advantages:
- 50% reduction in partial products
- Parallel CSA tree
- Regular structure
- Good synthesis results
Used for: DIV, MOD
Algorithm: Fully unrolled 32-stage division
Initialization:
sign_dividend = dividend[31]
sign_divisor = divisor[31]
mag_dividend = sign_dividend ? (~dividend + 1) : dividend
mag_divisor = sign_divisor ? (~divisor + 1) : divisor
quotient_sign = sign_dividend XOR sign_divisor
remainder_sign = sign_dividendIteration (32 stages, fully combinational):
For i = 31 downto 0:
partial_remainder = (remainder << 1) | dividend[i]
if (partial_remainder >= 0):
remainder = partial_remainder - divisor (Brent-Kung adder)
quotient[i] = 1
else:
remainder = partial_remainder + divisor (Brent-Kung adder)
quotient[i] = 0
Correction & Sign Restoration:
if (remainder < 0):
remainder = remainder + divisor
quotient = quotient - 1
quotient = quotient_sign ? (~quotient + 1) : quotient
remainder = remainder_sign ? (~remainder + 1) : remainderDivision by Zero Handling:
if (divisor == 0):
quotient = 32'hFFFFFFFF
remainder = dividend
zero_flag = 1 (for DIV), 0 (for MOD)Complexity:
- Time: 32 stages Γ ~5 levels per adder = ~160 logic levels
- Area: 32 Γ 32-bit adder = Very large
- Critical path: ~38 ns (major bottleneck!)
Characteristics:
- β Single-cycle operation
- β No state machine required
- β Deterministic timing
- β Extremely long critical path
- β High area consumption
Used for: SLL (Shift Left Logical), SRL (Shift Right Logical), SRA (Shift Right Arithmetic)
Algorithm: Logarithmic shifter with 5 stages
Stage 0: shift by 0 or 1 (controlled by shift_amount[0])
Stage 1: shift by 0 or 2 (controlled by shift_amount[1])
Stage 2: shift by 0 or 4 (controlled by shift_amount[2])
Stage 3: shift by 0 or 8 (controlled by shift_amount[3])
Stage 4: shift by 0 or 16 (controlled by shift_amount[4])
Arithmetic Shift Right:
- Fill with sign bit:
{sign_bit, data[31:1]} - Preserves sign for negative numbers
Complexity:
- Time: O(log n) = 5 mux stages
- Area: O(n log n) multiplexers
- Critical path: ~1-2 ns
Implemented directly:
AND: y = a & b
OR: y = a | b
XOR: y = a ^ b
NOT: y = ~b
PASS: y = bComplexity: 1 logic level, negligible delay
// Perform signed subtraction
diff = a - b (using Brent-Kung subtractor)
// Check MSB (sign bit)
slt_result = diff[31] ? 1 : 0always @(*) begin
case (op)
`ALU_ADD: y = bk_adder_result;
`ALU_SUB: y = bk_subtractor_result;
`ALU_MUL: y = booth_multiplier_result;
`ALU_DIV: y = nonrestoring_quotient;
`ALU_MOD: y = nonrestoring_remainder;
`ALU_AND: y = a & b;
`ALU_OR: y = a | b;
`ALU_XOR: y = a ^ b;
`ALU_SLL: y = barrel_shift_left;
`ALU_SRL: y = barrel_shift_right_logical;
`ALU_SRA: y = barrel_shift_right_arithmetic;
`ALU_SLT: y = {31'b0, set_if_less_than};
`ALU_NOT: y = ~b;
`ALU_PASS: y = b;
default: y = 32'h0;
endcase
// Zero flag
zero = (y == 32'd0);
endThe SimpleRISC toolchain includes a complete Python-based assembler for converting assembly language to machine code.
Features:
- β Full ISA support (21 instructions)
- β Label resolution for branches
- β Immediate value encoding
- β Error detection and reporting
- β Hex file generation
- β Listing file generation
- β Disassembler support
Basic Assembly:
# Convert assembly to hex
make assemble
## if make is not working
python tools/asm.py program.asm program.hexComments:
# This is a comment
add r1, r2, r3 # Inline commentLabels:
main:
addi r1, r0, 10
loop:
subi r1, r1, 1
bgt r1, r0, loop # Branch to labelDirectives:
.text # Code section (default)
.data # Data sectionExample : Fibonacci Sequence
# fibonacci.asm - Calculate first 10 Fibonacci numbers
.text
movi r1, 0 # fib(0) = 0
movi r2, 1 # fib(1) = 1
movi r3, 10 # count = 10
movi r4, 0 # index = 0
loop:
beq r4, r3, done # if index == count, exit
# Store current number
st r1, r4, 0x1000 # mem[0x1000 + index*4] = fib(n)
# Calculate next
add r5, r1, r2 # next = fib(n) + fib(n-1)
mov r1, r2 # fib(n) = fib(n-1)
mov r2, r5 # fib(n-1) = next
addi r4, r4, 1 # index++
b loop
done:
halt
# Results in memory:
# mem[0x1000] = 0
# mem[0x1001] = 1
# mem[0x1002] = 1
# mem[0x1003] = 2
# mem[0x1004] = 3
# mem[0x1005] = 5
# mem[0x1006] = 8
# mem[0x1007] = 13
# mem[0x1008] = 21
# mem[0x1009] = 34R-Type (Register-Register):
Assembly: add r3, r1, r2
Binary: 00000 0 0011 0001 0010 00000000000000
Hex: 0x00C60000
Breakdown:
OP=00000 (ADD)
I=0 (use register)
RD=0011 (r3)
RS1=0001 (r1)
RS2=0010 (r2)
I-Type (Immediate):
Assembly: addi r1, r0, 42
Binary: 00000 1 0001 0000 000000000000101010
Hex: 0x4884002A
Breakdown:
OP=00000 (ADD)
I=1 (use immediate)
RD=0001 (r1)
RS1=0000 (r0)
IMM=42 (0x2A)
- Write Assembly Code
- Convert Assembly to Hex by assembler
- Run tb_simplerisc.v against your hex codes by:
# make command to run simplerisc
make test_simplerisc
## if make is not working
iverilog -g2012 -I rtl -o build/tb_simplerisc.vvp rtl/*.v tb/tb_simplerisc.v
vvp build/tb_simplerisc.vvp- Analyze the wave dumps by GTKWave / Verilog
| Component | Logic Levels | Estimated Delay | Max Frequency |
|---|---|---|---|
| Program Counter | 1 | ~0.5 ns | N/A |
| Instruction Fetch | 5 | ~2 ns | N/A |
| Decode | 2 | ~1 ns | N/A |
| ALU - ADD/SUB | 5 | ~1.2 ns | ~800 MHz |
| ALU - Logic | 1 | ~0.3 ns | ~3 GHz |
| ALU - Shifts | 5 | ~1.5 ns | ~650 MHz |
| ALU - MUL | 30 | ~10 ns | ~100 MHz |
| ALU - DIV | 160+ | ~38 ns | ~26 MHz |
| Memory Access | 10 | ~3 ns | N/A |
| Writeback | 1 | ~0.5 ns | N/A |
Worst Case (Division):
PC β IMEM β Decode β RegFile β ALU (DIV) β WB
0.5ns + 2ns + 1ns + 1ns + 38ns + 0.5ns = ~43 ns
Maximum Frequency: ~23 MHz
Without Division (Multiplication):
PC β IMEM β Decode β RegFile β ALU (MUL) β WB
0.5ns + 2ns + 1ns + 1ns + 10ns + 0.5ns = ~15 ns
Maximum Frequency: ~66 MHz
Simple Operations (ADD/Logic):
PC β IMEM β Decode β RegFile β ALU (ADD) β WB
0.5ns + 2ns + 1ns + 1ns + 1.2ns + 0.5ns = ~6.2 ns
Maximum Frequency: ~160 MHz
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Full ALU | 250 MHz | ~23 MHz | β 10Γ too slow |
| Without DIV | 250 MHz | ~66 MHz | β 3.8Γ too slow |
| Basic Ops | 250 MHz | ~160 MHz |
The 250 MHz target (4 ns period) is difficult for several fundamental reasons:
-
Division Bottleneck: 38 ns critical path (9.5Γ target)
- 32 cascaded Brent-Kung adders
- Fully combinational implementation
- Unavoidable in single-cycle architecture
-
Multiplication Path: 10 ns (2.5Γ target)
- Could reach ~200 MHz with aggressive optimization
- Still short of 250 MHz
-
Single-Cycle Constraint:
- All operations must complete in one clock
- Cannot pipeline or use multi-cycle execution
- Limits optimization options
Solutions that would work:
- β Multi-cycle execution (DIV = 32 cycles, MUL = 3 cycles)
- β Pipelined ALU (3-5 stage pipeline)
- β Separate fast/slow execution paths
What doesn't work:
- β Single-cycle combinational DIV/MUL at 250 MHz
- β Faster algorithms (limited by physics)
simplerisc-processor/
βββ rtl/ # RTL source files
β βββ simplerisc_top.v # Top-level processor
β βββ alu.v # ALU (optimized)
β β βββ bk_adder32.v # Brent-Kung adder
β β βββ bk_subtractor32.v # Brent-Kung subtractor
β β βββ mul_tree32.v # Booth multiplier
β β βββ div_nonrestoring32_bk.v
β β βββ sll32.v # Barrel shifter (left)
β β βββ srl32.v # Barrel shifter (right logical)
β β βββ sra32.v # Barrel shifter (right arithmetic)
β β βββ slt.v # comparison of 2 var (smaller than)
β βββ control_unit.v # Instruction decoder
β βββ regfile.v # Register file
β βββ imem.v # Instruction memory
β βββ dmem.v # Data memory
β βββ immu.v # Immediate unit
β βββ decode.vh # ISA definitions
β βββ branch_target.v # Branch target calculator
β
βββ testbench/
β βββ alu_tb.v # ALU unit tests
β βββ tb_simplerisc.v # complete simplerisc test
β
βββ tools/
β βββ asm.py # ASM β HEX assembler
β
βββ programs/
β βββ test_arithmetic.asm # Arithmetic tests
β βββ test_memory.asm # Load/store tests
β βββ test_branches.asm # Control flow tests
β βββ fibonacci.asm # Fibonacci sequence
β βββ factorial.asm # Factorial calculation
β βββ bubble_sort.asm # Sorting algorithm
β βββ gcd.asm # GCD algorithm
|
βββ program.asm
βββ program.hex
βββ Makefile
βββ README.md
Choice: Single-Cycle
| Aspect | Single-Cycle | Multi-Cycle |
|---|---|---|
| CPI | Always 1 | 1-32 (variable) |
| Clock Frequency | ~23 MHz | 250+ MHz |
| Control Logic | Simple | Complex (FSM) |
| Throughput | 23 MIPS | 50-250 MIPS |
| Latency | Predictable | Variable |
Justification: Educational clarity, simple control, predictable timing
Choice: Brent-Kung
| Adder Type | Delay | Wiring | Area |
|---|---|---|---|
| Ripple-Carry | O(n) | Simple | Tiny |
| Brent-Kung | O(log n) | Moderate | Medium |
| Kogge-Stone | O(log n) | Complex | Large |
Justification: Best balance of speed, area, and routability
Choice: Radix-4 Booth + CSA Tree
Alternatives:
- Array multiplier: Simple but slow (O(nΒ²))
- Wallace tree: Slightly faster, more irregular
- Dadda tree: Similar to Wallace
- Sequential: Much smaller area, much slower
Justification: Good speed/area balance, regular structure
Choice: Fully Unrolled Non-Restoring
Alternatives:
- Sequential restoring: 32 cycles, tiny area
- SRT division: Complex, needs lookup tables
- Convergence division: Iterative, variable latency
Justification: Single-cycle requirement forced fully combinational design
The Problem: Creates massive critical path (~38 ns)
Better Alternatives (require architectural changes):
- Multi-cycle sequential divider (32 cycles, ~1 MHz throughput)
- Pipelined SRT divider (complex but high throughput)
-
Architecture Constrains Performance
- Single-cycle requirement fundamentally limits frequency
- Division/multiplication need pipelining for high performance
- Trade-off between CPI and clock frequency
-
Algorithm Selection Matters
- Brent-Kung: excellent adder choice
- Booth encoding: essential for multiplication
- Barrel shifters: simple and effective
- Non-restoring division: functional but too slow
-
Area-Speed Trade-offs
- Can't optimize all three: area, speed, power
- Division uses massive area for marginal performance gain
- Simple operations easily meet timing
-
Verification is Critical
- Assembly-level testing catches integration bugs
- Unit tests verify component correctness
- Edge cases reveal design issues
For Educational Projects:
- β Single-cycle: Simple, easy to understand
- β Basic operations: Meet timing easily
β οΈ Complex arithmetic: Use multi-cycle or omit
For Performance:
- β Multi-cycle execution for DIV (32 cycles)
- β 3-stage pipeline for MUL
- β Single-cycle for ADD/SUB/Logic
- β Target: 250+ MHz for simple operations
For Area Efficiency:
- β Sequential divider (tiny area)
- β Iterative multiplier
- β Ripple-carry adder (if speed not critical)
- Verilog (IEEE 1364-2005) - Hardware description
- Python 3.x - Toolchain development
- Assembly - SimpleRISC ISA
- Icarus Verilog - Open-source simulation
- ModelSim/VCS - Commercial simulation (optional)
- GTKWave - Waveform viewing
- Make - Build automation
| Metric | Value | Notes |
|---|---|---|
| Instruction Set | 21 instructions | Full RISC ISA |
| Data Path | 32-bit | Industry standard |
| Registers | 16 Γ 32-bit | Including SP, RA |
| Max Frequency (Full) | ~23 MHz | Limited by division |
| Max Frequency (No DIV) | ~66 MHz | Limited by multiplication |
| Max Frequency (Basic) | ~160 MHz | ADD/SUB/Logic only |
| CPI | 1.0 | Single-cycle execution |
| Functional Correctness | β 100% | All tests pass |