🚀 SimpleRISC Processor Core

A complete single-cycle RISC processor implementation in Verilog featuring an optimized ALU with advanced arithmetic algorithms

This project implements a fully functional 32-bit RISC processor with a custom instruction set (SimpleRISC ISA). The processor features a high-performance ALU utilizing Brent-Kung parallel-prefix adders, Radix-4 Booth multiplication with CSA trees, and non-restoring division, along with a complete Python-based assembler toolchain for program development and verification.

🎯 Project Highlights

🏗️ Complete Processor Core: Full single-cycle RISC implementation (Fetch → Decode → Execute → Memory → Writeback)
⚡ Optimized ALU: Advanced arithmetic algorithms (Brent-Kung, Booth, Non-Restoring Division)
📝 21 Instructions: Arithmetic, logic, shifts, memory operations, branches, function calls
🔧 16 Registers: General purpose (r0-r13) + Stack Pointer (SP/r14) + Return Address (RA/r15)
🛠️ Python Toolchain: Custom assembler for ASM → HEX conversion
✅ Fully Verified: Comprehensive testbenches and assembly test programs

💡 Key Innovation: High-performance arithmetic unit balancing speed, area, and complexity while maintaining full ISA compatibility

🏗️ Architecture Overview

High-Level Block Diagram

┌───────────────────────────────────────────────────────────────-───┐
│                    SimpleRISC Processor Core                      │
│                                                                   │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐     │
│  │          │    │          │    │          │    │          │     │
│  │  FETCH   │───>│  DECODE  │───>│ EXECUTE  │───>│  MEMORY  │     │
│  │          │    │          │    │          │    │          │     │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘     │
│       │               │                │               │          │
│       │               │                │               │          │
│       v               v                v               v          │
│  ┌─────────┐    ┌──────────┐    ┌──────────┐   ┌──────────┐       │
│  │ Instr.  │    │ Control  │    │   ALU    │   │   Data   │       │
│  │ Memory  │    │   Unit   │    │ Optimized│   │  Memory  │       │
│  │ (IMEM)  │    │          │    │          │   │  (DMEM)  │       │
│  └─────────┘    └──────────┘    └──────────┘   └──────────┘       │
│                       │                │               │          │
│                       v                v               v          │
│                  ┌─────────────────────────────────┐              │
│                  │      Register File (16 regs)    │              │
│                  │  r0-r13, SP(r14), RA(r15)       │              │
│                  └─────────────────────────────────┘              │
│                                                                   │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │              Program Counter (PC) & Control Logic          │   │
│  │         Branch Target Calculation | Return Address         │   │
│  └────────────────────────────────────────────────────────────┘   │
│                                                                   │
└──────────────────────────────────────────────────────────────────-┘

Single-Cycle Execution Model

Each instruction completes in one clock cycle through five conceptual stages:

Fetch (IF): Read instruction from instruction memory using PC
Decode (ID): Extract opcode, registers, immediate values; read register file
Execute (EX): Perform operation in ALU (arithmetic, logic, address calculation)
Memory (MEM): Access data memory for load/store operations
Writeback (WB): Write result back to register file

Data Path Flow

PC → IMEM → Instruction
           ↓
     [Decode Fields]
           ↓
   Control Unit → Control Signals
           ↓
    Register File → RS1, RS2
           ↓
    [Operand Selection: Reg vs Imm]
           ↓
         ALU → Result
           ↓
    [Memory Access if LD/ST]
           ↓
    [WB Mux: ALU vs MEM]
           ↓
    Register File ← Write Data

Branch Logic → PC Next (PC+4 vs Branch Target)

✨ Features

Processor Capabilities

✅ 32-bit Architecture

32-bit data path and address space
Word-aligned memory access
4GB addressable memory

✅ 16 General-Purpose Registers

r0-r13: General purpose
r14 (SP): Stack pointer
r15 (RA): Return address (link register)

✅ 21 Instructions

9 Arithmetic/Logic operations
3 Shift operations
2 Data movement operations
2 Memory operations
5 Control flow operations

✅ Single-Cycle Execution

All instructions complete in one cycle
Simplified control logic
Predictable timing

✅ Harvard Architecture

Separate instruction and data memories
Simultaneous fetch and memory access
No structural hazards

ALU Capabilities

🔧 14 Operations (4-bit opcode):

Addition, Subtraction (Brent-Kung adder)
Multiplication (Radix-4 Booth + CSA tree)
Division, Modulus (Non-restoring algorithm)
Logical: AND, OR, XOR, NOT
Shifts: SLL, SRL, SRA (Barrel shifters)
Comparison: SLT (Set if Less Than)
Data movement: PASS

🧩 Processor Components

1. Instruction Memory (IMEM)

module imem (
    input  [29:0] addr,     // Word-aligned address (PC[31:2])
    output [31:0] instr     // 32-bit instruction
);

Features:

Read-only memory for program storage
Initialized from hex file during simulation
Word-addressed (bottom 2 bits of PC ignored)
Synchronous/asynchronous read (implementation dependent)

2. Control Unit

module control_unit (
    input  [4:0] op,        // 5-bit opcode
    input        Ibit,      // Immediate flag
    output [3:0] alu_op,    // ALU operation
    output       use_imm,   // Use immediate vs register
    output       mem_read,  // Memory read enable
    output       mem_write, // Memory write enable
    output       wb_en,     // Register writeback enable
    output       wb_from_mem, // Writeback from memory
    // Branch control signals...
);

Functionality:

Decodes 5-bit instruction opcode
Generates control signals for all processor components
Handles special cases (MOV, NOT, CMP, branches)
Maps instruction opcodes to ALU operations

Instruction → ALU Mapping:

OP_ADD  → ALU_ADD
OP_MUL  → ALU_MUL
OP_LD   → ALU_ADD  (address calculation)
OP_BEQ  → Set branch flags (no ALU operation)

3. Register File

module regfile (
    input         clk,
    input         we,          // Write enable
    input  [3:0]  rs1, rs2,    // Source registers
    input  [3:0]  rd,          // Destination register
    input  [31:0] wdata,       // Write data
    output [31:0] rdata1,      // RS1 data
    output [31:0] rdata2       // RS2 data
);

Features:

16 × 32-bit registers
Dual read ports (RS1, RS2)
Single write port (RD)
Synchronous write, asynchronous read
Special registers:
- r14 (SP): Stack operations
- r15 (RA): Function return addresses

4. Arithmetic Logic Unit (ALU) ⭐

The heart of the processor - See ALU Design Deep Dive for details.

module alu (
    input  [31:0] a, b,     // Operands
    input  [3:0]  op,       // Operation selector
    output [31:0] y,        // Result
    output        zero      // Zero flag
);

Sub-components:

bk_adder32: Brent-Kung parallel-prefix adder
mul_tree32: Radix-4 Booth multiplier with CSA tree
div_nonrestoring32_bk: Non-restoring divider
sll32, srl32, sra32: Barrel shifters
Logic gates for AND, OR, XOR, NOT

5. Data Memory (DMEM)

module dmem (
    input         clk,
    input         re,          // Read enable
    input         we,          // Write enable
    input  [31:0] addr,        // Address
    input  [31:0] wdata,       // Write data
    output [31:0] rdata        // Read data
);

Features:

Read/write data memory
Synchronous access (on clock edge)
Word-addressed
Stores program variables and stack

6. Immediate Unit (IMMU)

module immu (
    input  [17:0] imm18,    // 18-bit immediate from instruction
    output [31:0] immx      // 32-bit sign-extended immediate
);

Functionality:

Sign-extends 18-bit immediate to 32 bits
Used for:
- Immediate arithmetic (ADDI, SUBI, etc.)
- Memory offsets (LD, ST)
- Small constants

7. Branch Target Calculator

module branch_target (
    input  [26:0] off27,    // 27-bit offset
    input  [31:0] pc,       // Current PC
    output [31:0] target    // Branch target address
);

Functionality:

Calculates: target = PC + (sign_extend(off27) << 2)
Supports PC-relative addressing
Used for: BEQ, BGT, B, CALL

📋 Instruction Set Architecture

Instruction Format

All instructions are 32 bits:

┌─────────┬───┬──────┬──────┬──────┬────────────────────┐
│ OP[4:0] │ I │ RD   │ RS1  │ RS2/ │   IMM/OFFSET       │
│         │   │ [3:0]│ [3:0]│ [3:0]│                    │
└─────────┴───┴──────┴──────┴──────┴────────────────────┘
  31   27  26  25 22  21 18  17 14  13                 0

Fields:
- OP[4:0]:    5-bit operation code
- I:          Immediate flag (1 = immediate, 0 = register)
- RD[3:0]:    Destination register
- RS1[3:0]:   Source register 1
- RS2[3:0]:   Source register 2 (if I=0)
- IMM[17:0]:  18-bit immediate (if I=1), sign-extended
- OFFSET[26:0]: Branch offset (for branches)

Complete Instruction Set

Arithmetic Operations

Instruction	Opcode	Format	Description	Example
`ADD`	`00000`	R/I	Add	`add r1, r2, r3` → r1 = r2 + r3
`SUB`	`00001`	R/I	Subtract	`sub r1, r2, r3` → r1 = r2 - r3
`MUL`	`00010`	R/I	Multiply	`mul r1, r2, r3` → r1 = r2 × r3
`DIV`	`00011`	R/I	Divide	`div r1, r2, r3` → r1 = r2 ÷ r3
`MOD`	`00100`	R/I	Modulus	`mod r1, r2, r3` → r1 = r2 % r3

Logical Operations

Instruction	Opcode	Format	Description	Example
`AND`	`00110`	R/I	Bitwise AND	`and r1, r2, r3` → r1 = r2 & r3
`OR`	`00111`	R/I	Bitwise OR	`or r1, r2, r3` → r1 = r2 \| r3
`NOT`	`01000`	R/I	Bitwise NOT	`not r1, r2` → r1 = ~r2

Shift Operations

Instruction	Opcode	Format	Description	Example
`LSL`	`01010`	R/I	Logical Shift Left	`lsl r1, r2, 4` → r1 = r2 << 4
`LSR`	`01011`	R/I	Logical Shift Right	`lsr r1, r2, 4` → r1 = r2 >> 4
`ASR`	`01100`	R/I	Arithmetic Shift Right	`asr r1, r2, 4` → r1 = r2 >>> 4

Data Movement

Instruction	Opcode	Format	Description	Example
`MOV`	`01001`	R/I	Move	`mov r1, r2` → r1 = r2
`CMP`	`00101`	R/I	Compare (sets flags)	`cmp r1, r2` → flags = (r1 vs r2)

Memory Operations

Instruction	Opcode	Format	Description	Example
`LD`	`01110`	I	Load	`ld r1, r2, 100` → r1 = mem[r2+100]
`ST`	`01111`	I	Store	`st r1, r2, 100` → mem[r2+100] = r1

Note: Store encoding is special - the value to store is in RD field, not RS2.

Control Flow

Instruction	Opcode	Format	Description	Example
`BEQ`	`10000`	Branch	Branch if Equal	`beq r1, r2, label` → if (r1 == r2) goto label
`BGT`	`10001`	Branch	Branch if Greater	`bgt r1, r2, label` → if (r1 > r2) goto label
`B`	`10010`	Branch	Unconditional Branch	`b label` → goto label
`CALL`	`10011`	Branch	Function Call	`call func` → RA=PC+4; goto func
`RET`	`10100`	Branch	Return	`ret` → goto RA
`NOP`	`01101`	-	No Operation	`nop` → do nothing

Addressing Modes

Register-Register (R-type):

add r1, r2, r3    # r1 = r2 + r3

Register-Immediate (I-type):

addi r1, r2, 100  # r1 = r2 + 100

Memory Access:

ld  r1, r2, 100   # r1 = mem[r2 + 100]  (base + offset)
st  r1, r2, 100   # mem[r2 + 100] = r1

PC-Relative Branch:

beq r1, r2, label # if (r1 == r2) PC = PC + offset

⚡ ALU Design Deep Dive

The ALU is the performance-critical component of the processor, implementing all arithmetic and logical operations. It utilizes advanced algorithms to optimize speed, area, and power.

Architecture

                    ┌─────────────────────────────┐
    a[31:0] ───────>│                             │
                    │         ALU Core            │
    b[31:0] ───────>│                             │───> y[31:0]
                    │  - Brent-Kung Adder         │
    op[3:0] ───────>│  - Booth Multiplier         │───> zero
                    │  - Non-Restoring Divider    │
                    │  - Barrel Shifters          │
                    │  - Logic Gates              │
                    │                             │
                    └─────────────────────────────┘

Component Breakdown

1. Brent-Kung Parallel-Prefix Adder

Used for: ADD, SUB (via 2's complement)

Algorithm:

Stage 1: Pre-processing (Generate & Propagate)
  P[i] = a[i] XOR b[i]
  G[i] = a[i] AND b[i]

Stage 2: Prefix Tree (log₂(32) = 5 levels)
  Level 1: Combine adjacent pairs
  Level 2: Combine groups of 4
  Level 3: Combine groups of 8
  Level 4: Combine groups of 16
  Level 5: Combine full width

Stage 3: Post-processing (Final Sum)
  Sum[i] = P[i] XOR Carry[i-1]

Complexity:

Time: O(log n) = 5 stages
Area: O(n log n) gates
Critical path: ~1.2 ns

Advantages:

Balanced fan-out
Low wire complexity
Scalable to larger widths
Faster than ripple-carry by ~6×

Subtraction Implementation:

wire [31:0] b_neg = ~b + 1;  // Two's complement
// Use same adder: a + b_neg

2. Radix-4 Booth Multiplier with CSA Tree

Used for: MUL

Phase 1: Sign Handling

sign_a = a[31]
sign_b = b[31]
mag_a = sign_a ? (~a + 1) : a    // Absolute value
mag_b = sign_b ? (~b + 1) : b
result_sign = sign_a XOR sign_b

Phase 2: Booth Encoding

Scans multiplier in overlapping 3-bit windows, reducing 32 partial products to 16:

Window[2:0]	Partial Product
000, 111	0
001, 010	+multiplicand
011	+2×multiplicand
100	-2×multiplicand
101, 110	-multiplicand

Phase 3: CSA Tree Reduction

16 partial products
   ↓ (3:2 CSA)
11 operands
   ↓ (3:2 CSA)
 8 operands
   ↓ (3:2 CSA)
 6 operands
   ↓ (3:2 CSA)
 4 operands
   ↓ (3:2 CSA)
 3 operands
   ↓ (3:2 CSA)
 2 operands (sum + carry)
   ↓ (CPA - final addition)
64-bit product (take lower 32 bits)

Phase 4: Sign Restoration

product = result_sign ? (~intermediate + 1) : intermediate

Complexity:

Time: ~12-15 logic levels (CSA tree + final CPA)
Area: Moderate (16 partial products + tree)
Critical path: ~8-10 ns

Advantages:

50% reduction in partial products
Parallel CSA tree
Regular structure
Good synthesis results

3. Non-Restoring Divider

Used for: DIV, MOD

Algorithm: Fully unrolled 32-stage division

Initialization:

sign_dividend = dividend[31]
sign_divisor = divisor[31]
mag_dividend = sign_dividend ? (~dividend + 1) : dividend
mag_divisor = sign_divisor ? (~divisor + 1) : divisor
quotient_sign = sign_dividend XOR sign_divisor
remainder_sign = sign_dividend

Iteration (32 stages, fully combinational):

For i = 31 downto 0:
    partial_remainder = (remainder << 1) | dividend[i]
    
    if (partial_remainder >= 0):
        remainder = partial_remainder - divisor    (Brent-Kung adder)
        quotient[i] = 1
    else:
        remainder = partial_remainder + divisor    (Brent-Kung adder)
        quotient[i] = 0

Correction & Sign Restoration:

if (remainder < 0):
    remainder = remainder + divisor
    quotient = quotient - 1

quotient = quotient_sign ? (~quotient + 1) : quotient
remainder = remainder_sign ? (~remainder + 1) : remainder

Division by Zero Handling:

if (divisor == 0):
    quotient = 32'hFFFFFFFF
    remainder = dividend
    zero_flag = 1 (for DIV), 0 (for MOD)

Complexity:

Time: 32 stages × ~5 levels per adder = ~160 logic levels
Area: 32 × 32-bit adder = Very large
Critical path: ~38 ns (major bottleneck!)

Characteristics:

✅ Single-cycle operation
✅ No state machine required
✅ Deterministic timing
❌ Extremely long critical path
❌ High area consumption

4. Barrel Shifters

Used for: SLL (Shift Left Logical), SRL (Shift Right Logical), SRA (Shift Right Arithmetic)

Algorithm: Logarithmic shifter with 5 stages

Stage 0: shift by 0 or 1   (controlled by shift_amount[0])
Stage 1: shift by 0 or 2   (controlled by shift_amount[1])
Stage 2: shift by 0 or 4   (controlled by shift_amount[2])
Stage 3: shift by 0 or 8   (controlled by shift_amount[3])
Stage 4: shift by 0 or 16  (controlled by shift_amount[4])

Arithmetic Shift Right:

Fill with sign bit: {sign_bit, data[31:1]}
Preserves sign for negative numbers

Complexity:

Time: O(log n) = 5 mux stages
Area: O(n log n) multiplexers
Critical path: ~1-2 ns

5. Logical Operations

Implemented directly:

AND:  y = a & b
OR:   y = a | b
XOR:  y = a ^ b
NOT:  y = ~b
PASS: y = b

Complexity: 1 logic level, negligible delay

6. Set-on-Less-Than (SLT)

// Perform signed subtraction
diff = a - b  (using Brent-Kung subtractor)

// Check MSB (sign bit)
slt_result = diff[31] ? 1 : 0

ALU Operation Selection

always @(*) begin
    case (op)
        `ALU_ADD:  y = bk_adder_result;
        `ALU_SUB:  y = bk_subtractor_result;
        `ALU_MUL:  y = booth_multiplier_result;
        `ALU_DIV:  y = nonrestoring_quotient;
        `ALU_MOD:  y = nonrestoring_remainder;
        `ALU_AND:  y = a & b;
        `ALU_OR:   y = a | b;
        `ALU_XOR:  y = a ^ b;
        `ALU_SLL:  y = barrel_shift_left;
        `ALU_SRL:  y = barrel_shift_right_logical;
        `ALU_SRA:  y = barrel_shift_right_arithmetic;
        `ALU_SLT:  y = {31'b0, set_if_less_than};
        `ALU_NOT:  y = ~b;
        `ALU_PASS: y = b;
        default:   y = 32'h0;
    endcase
    
    // Zero flag
    zero = (y == 32'd0);
end

🔧 Assembler & Toolchain

Python Assembler Overview

The SimpleRISC toolchain includes a complete Python-based assembler for converting assembly language to machine code.

Features:

✅ Full ISA support (21 instructions)
✅ Label resolution for branches
✅ Immediate value encoding
✅ Error detection and reporting
✅ Hex file generation
✅ Listing file generation
✅ Disassembler support

Usage

Basic Assembly:

# Convert assembly to hex
make assemble

## if make is not working
python tools/asm.py program.asm program.hex

Assembly Syntax

Comments:

# This is a comment
add r1, r2, r3    # Inline comment

Labels:

main:
    addi r1, r0, 10
loop:
    subi r1, r1, 1
    bgt  r1, r0, loop    # Branch to label

Directives:

.text               # Code section (default)
.data               # Data section

Example : Fibonacci Sequence

# fibonacci.asm - Calculate first 10 Fibonacci numbers
.text
    movi r1, 0           # fib(0) = 0
    movi r2, 1           # fib(1) = 1
    movi r3, 10          # count = 10
    movi r4, 0           # index = 0
    
loop:
    beq  r4, r3, done    # if index == count, exit
    
    # Store current number
    st   r1, r4, 0x1000  # mem[0x1000 + index*4] = fib(n)
    
    # Calculate next
    add  r5, r1, r2      # next = fib(n) + fib(n-1)
    mov  r1, r2          # fib(n) = fib(n-1)
    mov  r2, r5          # fib(n-1) = next
    
    addi r4, r4, 1       # index++
    b    loop
    
done:
    halt

# Results in memory:
# mem[0x1000] = 0
# mem[0x1001] = 1
# mem[0x1002] = 1
# mem[0x1003] = 2
# mem[0x1004] = 3
# mem[0x1005] = 5
# mem[0x1006] = 8
# mem[0x1007] = 13
# mem[0x1008] = 21
# mem[0x1009] = 34

Instruction Encoding Examples

R-Type (Register-Register):

Assembly: add r3, r1, r2
Binary:   00000 0 0011 0001 0010 00000000000000
Hex:      0x00C60000

Breakdown:
  OP=00000 (ADD)
  I=0 (use register)
  RD=0011 (r3)
  RS1=0001 (r1)
  RS2=0010 (r2)

I-Type (Immediate):

Assembly: addi r1, r0, 42
Binary:   00000 1 0001 0000 000000000000101010
Hex:      0x4884002A

Breakdown:
  OP=00000 (ADD)
  I=1 (use immediate)
  RD=0001 (r1)
  RS1=0000 (r0)
  IMM=42 (0x2A)

Testing Workflow

Write Assembly Code
Convert Assembly to Hex by assembler
Run tb_simplerisc.v against your hex codes by:

# make command to run simplerisc
make test_simplerisc

## if make is not working
iverilog -g2012 -I rtl -o build/tb_simplerisc.vvp rtl/*.v  tb/tb_simplerisc.v
vvp build/tb_simplerisc.vvp

Analyze the wave dumps by GTKWave / Verilog

📊 Performance Analysis

Timing Summary

Component	Logic Levels	Estimated Delay	Max Frequency
Program Counter	1	~0.5 ns	N/A
Instruction Fetch	5	~2 ns	N/A
Decode	2	~1 ns	N/A
ALU - ADD/SUB	5	~1.2 ns	~800 MHz
ALU - Logic	1	~0.3 ns	~3 GHz
ALU - Shifts	5	~1.5 ns	~650 MHz
ALU - MUL	30	~10 ns	~100 MHz
ALU - DIV	160+	~38 ns	~26 MHz
Memory Access	10	~3 ns	N/A
Writeback	1	~0.5 ns	N/A

Critical Path Analysis

Worst Case (Division):

PC → IMEM → Decode → RegFile → ALU (DIV) → WB
  0.5ns + 2ns + 1ns + 1ns + 38ns + 0.5ns = ~43 ns

Maximum Frequency: ~23 MHz

Without Division (Multiplication):

PC → IMEM → Decode → RegFile → ALU (MUL) → WB
  0.5ns + 2ns + 1ns + 1ns + 10ns + 0.5ns = ~15 ns

Maximum Frequency: ~66 MHz

Simple Operations (ADD/Logic):

PC → IMEM → Decode → RegFile → ALU (ADD) → WB
  0.5ns + 2ns + 1ns + 1ns + 1.2ns + 0.5ns = ~6.2 ns

Maximum Frequency: ~160 MHz

Performance vs Target

Metric	Target	Achieved	Status
Full ALU	250 MHz	~23 MHz	❌ 10× too slow
Without DIV	250 MHz	~66 MHz	❌ 3.8× too slow
Basic Ops	250 MHz	~160 MHz	⚠️ 1.5× too slow

Why 250 MHz is Challenging

The 250 MHz target (4 ns period) is difficult for several fundamental reasons:

Division Bottleneck: 38 ns critical path (9.5× target)
- 32 cascaded Brent-Kung adders
- Fully combinational implementation
- Unavoidable in single-cycle architecture
Multiplication Path: 10 ns (2.5× target)
- Could reach ~200 MHz with aggressive optimization
- Still short of 250 MHz
Single-Cycle Constraint:
- All operations must complete in one clock
- Cannot pipeline or use multi-cycle execution
- Limits optimization options

Solutions that would work:

✅ Multi-cycle execution (DIV = 32 cycles, MUL = 3 cycles)
✅ Pipelined ALU (3-5 stage pipeline)
✅ Separate fast/slow execution paths

What doesn't work:

❌ Single-cycle combinational DIV/MUL at 250 MHz
❌ Faster algorithms (limited by physics)

📁 Project Structure

simplerisc-processor/
├── rtl/                          # RTL source files
│   ├── simplerisc_top.v         # Top-level processor
│   ├── alu.v                    # ALU (optimized)
│   │   ├── bk_adder32.v        # Brent-Kung adder
│   │   ├── bk_subtractor32.v   # Brent-Kung subtractor
│   │   ├── mul_tree32.v        # Booth multiplier
│   │   ├── div_nonrestoring32_bk.v
│   │   ├── sll32.v             # Barrel shifter (left)
│   │   ├── srl32.v             # Barrel shifter (right logical)
│   │   ├── sra32.v             # Barrel shifter (right arithmetic)
│   │   └── slt.v               # comparison of 2 var (smaller than)
│   ├── control_unit.v           # Instruction decoder
│   ├── regfile.v                # Register file
│   ├── imem.v                   # Instruction memory
│   ├── dmem.v                   # Data memory
│   ├── immu.v                   # Immediate unit
│   ├── decode.vh                # ISA definitions
│   └── branch_target.v          # Branch target calculator
│
├── testbench/
│   ├── alu_tb.v                 # ALU unit tests
│   └── tb_simplerisc.v          # complete simplerisc test
│
├── tools/
│   └──  asm.py                  # ASM → HEX assembler
│
├── programs/
│   ├── test_arithmetic.asm      # Arithmetic tests
│   ├── test_memory.asm          # Load/store tests
│   ├── test_branches.asm        # Control flow tests
│   ├── fibonacci.asm            # Fibonacci sequence
│   ├── factorial.asm            # Factorial calculation
│   ├── bubble_sort.asm          # Sorting algorithm
│   └── gcd.asm                  # GCD algorithm
|
├── program.asm
├── program.hex
├── Makefile
└── README.md

⚖️ Design Trade-offs

1. Single-Cycle vs Multi-Cycle

Choice: Single-Cycle

Aspect	Single-Cycle	Multi-Cycle
CPI	Always 1	1-32 (variable)
Clock Frequency	~23 MHz	250+ MHz
Control Logic	Simple	Complex (FSM)
Throughput	23 MIPS	50-250 MIPS
Latency	Predictable	Variable

Justification: Educational clarity, simple control, predictable timing

2. Adder Selection

Choice: Brent-Kung

Adder Type	Delay	Wiring	Area
Ripple-Carry	O(n)	Simple	Tiny
Brent-Kung	O(log n)	Moderate	Medium
Kogge-Stone	O(log n)	Complex	Large

Justification: Best balance of speed, area, and routability

3. Multiplier Architecture

Choice: Radix-4 Booth + CSA Tree

Alternatives:

Array multiplier: Simple but slow (O(n²))
Wallace tree: Slightly faster, more irregular
Dadda tree: Similar to Wallace
Sequential: Much smaller area, much slower

Justification: Good speed/area balance, regular structure

4. Divider Implementation

Choice: Fully Unrolled Non-Restoring

Alternatives:

Sequential restoring: 32 cycles, tiny area
SRT division: Complex, needs lookup tables
Convergence division: Iterative, variable latency

Justification: Single-cycle requirement forced fully combinational design

The Problem: Creates massive critical path (~38 ns)

Better Alternatives (require architectural changes):

Multi-cycle sequential divider (32 cycles, ~1 MHz throughput)
Pipelined SRT divider (complex but high throughput)

💡 Lessons Learned

Technical Insights

Architecture Constrains Performance
- Single-cycle requirement fundamentally limits frequency
- Division/multiplication need pipelining for high performance
- Trade-off between CPI and clock frequency
Algorithm Selection Matters
- Brent-Kung: excellent adder choice
- Booth encoding: essential for multiplication
- Barrel shifters: simple and effective
- Non-restoring division: functional but too slow
Area-Speed Trade-offs
- Can't optimize all three: area, speed, power
- Division uses massive area for marginal performance gain
- Simple operations easily meet timing
Verification is Critical
- Assembly-level testing catches integration bugs
- Unit tests verify component correctness
- Edge cases reveal design issues

Design Recommendations

For Educational Projects:

✅ Single-cycle: Simple, easy to understand
✅ Basic operations: Meet timing easily
⚠️ Complex arithmetic: Use multi-cycle or omit

For Performance:

✅ Multi-cycle execution for DIV (32 cycles)
✅ 3-stage pipeline for MUL
✅ Single-cycle for ADD/SUB/Logic
✅ Target: 250+ MHz for simple operations

For Area Efficiency:

✅ Sequential divider (tiny area)
✅ Iterative multiplier
✅ Ripple-carry adder (if speed not critical)

🛠️ Built With

Languages

Verilog (IEEE 1364-2005) - Hardware description
Python 3.x - Toolchain development
Assembly - SimpleRISC ISA

Tools

Icarus Verilog - Open-source simulation
ModelSim/VCS - Commercial simulation (optional)
GTKWave - Waveform viewing
Make - Build automation

🎯 Performance Summary

Metric	Value	Notes
Instruction Set	21 instructions	Full RISC ISA
Data Path	32-bit	Industry standard
Registers	16 × 32-bit	Including SP, RA
Max Frequency (Full)	~23 MHz	Limited by division
Max Frequency (No DIV)	~66 MHz	Limited by multiplication
Max Frequency (Basic)	~160 MHz	ADD/SUB/Logic only
CPI	1.0	Single-cycle execution
Functional Correctness	✅ 100%	All tests pass

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
rtl		rtl
tb		tb
tools		tools
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
program.asm		program.asm
program.hex		program.hex

Illumanizer/Simple_RISC_processor

Folders and files

Latest commit

History

Repository files navigation

🚀 SimpleRISC Processor Core

🎯 Project Highlights

🏗️ Architecture Overview

High-Level Block Diagram

Single-Cycle Execution Model

Data Path Flow

✨ Features

Processor Capabilities

ALU Capabilities

🧩 Processor Components

1. Instruction Memory (IMEM)

2. Control Unit

3. Register File

4. Arithmetic Logic Unit (ALU) ⭐

5. Data Memory (DMEM)

6. Immediate Unit (IMMU)

7. Branch Target Calculator

📋 Instruction Set Architecture

Instruction Format

Complete Instruction Set

Arithmetic Operations

Logical Operations

Shift Operations

Data Movement

Memory Operations

Control Flow

Addressing Modes

⚡ ALU Design Deep Dive

Architecture

Component Breakdown

1. Brent-Kung Parallel-Prefix Adder

2. Radix-4 Booth Multiplier with CSA Tree

3. Non-Restoring Divider

4. Barrel Shifters

5. Logical Operations

6. Set-on-Less-Than (SLT)

ALU Operation Selection

🔧 Assembler & Toolchain

Python Assembler Overview

Usage

Assembly Syntax

Instruction Encoding Examples

Testing Workflow

📊 Performance Analysis

Timing Summary

Critical Path Analysis

Performance vs Target

Why 250 MHz is Challenging

📁 Project Structure

⚖️ Design Trade-offs

1. Single-Cycle vs Multi-Cycle

2. Adder Selection

3. Multiplier Architecture

4. Divider Implementation

💡 Lessons Learned

Technical Insights

Design Recommendations

🛠️ Built With

Languages

Tools

🎯 Performance Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages