Skip to content

A complete single-cycle RISC processor implementation in Verilog featuring an optimized ALU with advanced arithmetic algorithms

Notifications You must be signed in to change notification settings

Illumanizer/Simple_RISC_processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ SimpleRISC Processor Core

A complete single-cycle RISC processor implementation in Verilog featuring an optimized ALU with advanced arithmetic algorithms

This project implements a fully functional 32-bit RISC processor with a custom instruction set (SimpleRISC ISA). The processor features a high-performance ALU utilizing Brent-Kung parallel-prefix adders, Radix-4 Booth multiplication with CSA trees, and non-restoring division, along with a complete Python-based assembler toolchain for program development and verification.

🎯 Project Highlights

  • πŸ—οΈ Complete Processor Core: Full single-cycle RISC implementation (Fetch β†’ Decode β†’ Execute β†’ Memory β†’ Writeback)
  • ⚑ Optimized ALU: Advanced arithmetic algorithms (Brent-Kung, Booth, Non-Restoring Division)
  • πŸ“ 21 Instructions: Arithmetic, logic, shifts, memory operations, branches, function calls
  • πŸ”§ 16 Registers: General purpose (r0-r13) + Stack Pointer (SP/r14) + Return Address (RA/r15)
  • πŸ› οΈ Python Toolchain: Custom assembler for ASM β†’ HEX conversion
  • βœ… Fully Verified: Comprehensive testbenches and assembly test programs

πŸ’‘ Key Innovation: High-performance arithmetic unit balancing speed, area, and complexity while maintaining full ISA compatibility


πŸ—οΈ Architecture Overview

High-Level Block Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-───┐
β”‚                    SimpleRISC Processor Core                      β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚          β”‚    β”‚          β”‚    β”‚          β”‚    β”‚          β”‚     β”‚
β”‚  β”‚  FETCH   │───>β”‚  DECODE  │───>β”‚ EXECUTE  │───>β”‚  MEMORY  β”‚     β”‚
β”‚  β”‚          β”‚    β”‚          β”‚    β”‚          β”‚    β”‚          β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚       β”‚               β”‚                β”‚               β”‚          β”‚
β”‚       β”‚               β”‚                β”‚               β”‚          β”‚
β”‚       v               v                v               v          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚ Instr.  β”‚    β”‚ Control  β”‚    β”‚   ALU    β”‚   β”‚   Data   β”‚       β”‚
β”‚  β”‚ Memory  β”‚    β”‚   Unit   β”‚    β”‚ Optimizedβ”‚   β”‚  Memory  β”‚       β”‚
β”‚  β”‚ (IMEM)  β”‚    β”‚          β”‚    β”‚          β”‚   β”‚  (DMEM)  β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                       β”‚                β”‚               β”‚          β”‚
β”‚                       v                v               v          β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚                  β”‚      Register File (16 regs)    β”‚              β”‚
β”‚                  β”‚  r0-r13, SP(r14), RA(r15)       β”‚              β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Program Counter (PC) & Control Logic          β”‚   β”‚
β”‚  β”‚         Branch Target Calculation | Return Address         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                   β”‚
└──────────────────────────────────────────────────────────────────-β”˜

Single-Cycle Execution Model

Each instruction completes in one clock cycle through five conceptual stages:

  1. Fetch (IF): Read instruction from instruction memory using PC
  2. Decode (ID): Extract opcode, registers, immediate values; read register file
  3. Execute (EX): Perform operation in ALU (arithmetic, logic, address calculation)
  4. Memory (MEM): Access data memory for load/store operations
  5. Writeback (WB): Write result back to register file

Data Path Flow

PC β†’ IMEM β†’ Instruction
           ↓
     [Decode Fields]
           ↓
   Control Unit β†’ Control Signals
           ↓
    Register File β†’ RS1, RS2
           ↓
    [Operand Selection: Reg vs Imm]
           ↓
         ALU β†’ Result
           ↓
    [Memory Access if LD/ST]
           ↓
    [WB Mux: ALU vs MEM]
           ↓
    Register File ← Write Data

Branch Logic β†’ PC Next (PC+4 vs Branch Target)

✨ Features

Processor Capabilities

βœ… 32-bit Architecture

  • 32-bit data path and address space
  • Word-aligned memory access
  • 4GB addressable memory

βœ… 16 General-Purpose Registers

  • r0-r13: General purpose
  • r14 (SP): Stack pointer
  • r15 (RA): Return address (link register)

βœ… 21 Instructions

  • 9 Arithmetic/Logic operations
  • 3 Shift operations
  • 2 Data movement operations
  • 2 Memory operations
  • 5 Control flow operations

βœ… Single-Cycle Execution

  • All instructions complete in one cycle
  • Simplified control logic
  • Predictable timing

βœ… Harvard Architecture

  • Separate instruction and data memories
  • Simultaneous fetch and memory access
  • No structural hazards

ALU Capabilities

πŸ”§ 14 Operations (4-bit opcode):

  • Addition, Subtraction (Brent-Kung adder)
  • Multiplication (Radix-4 Booth + CSA tree)
  • Division, Modulus (Non-restoring algorithm)
  • Logical: AND, OR, XOR, NOT
  • Shifts: SLL, SRL, SRA (Barrel shifters)
  • Comparison: SLT (Set if Less Than)
  • Data movement: PASS

🧩 Processor Components

1. Instruction Memory (IMEM)

module imem (
    input  [29:0] addr,     // Word-aligned address (PC[31:2])
    output [31:0] instr     // 32-bit instruction
);

Features:

  • Read-only memory for program storage
  • Initialized from hex file during simulation
  • Word-addressed (bottom 2 bits of PC ignored)
  • Synchronous/asynchronous read (implementation dependent)

2. Control Unit

module control_unit (
    input  [4:0] op,        // 5-bit opcode
    input        Ibit,      // Immediate flag
    output [3:0] alu_op,    // ALU operation
    output       use_imm,   // Use immediate vs register
    output       mem_read,  // Memory read enable
    output       mem_write, // Memory write enable
    output       wb_en,     // Register writeback enable
    output       wb_from_mem, // Writeback from memory
    // Branch control signals...
);

Functionality:

  • Decodes 5-bit instruction opcode
  • Generates control signals for all processor components
  • Handles special cases (MOV, NOT, CMP, branches)
  • Maps instruction opcodes to ALU operations

Instruction β†’ ALU Mapping:

OP_ADD  β†’ ALU_ADD
OP_MUL  β†’ ALU_MUL
OP_LD   β†’ ALU_ADD  (address calculation)
OP_BEQ  β†’ Set branch flags (no ALU operation)

3. Register File

module regfile (
    input         clk,
    input         we,          // Write enable
    input  [3:0]  rs1, rs2,    // Source registers
    input  [3:0]  rd,          // Destination register
    input  [31:0] wdata,       // Write data
    output [31:0] rdata1,      // RS1 data
    output [31:0] rdata2       // RS2 data
);

Features:

  • 16 Γ— 32-bit registers
  • Dual read ports (RS1, RS2)
  • Single write port (RD)
  • Synchronous write, asynchronous read
  • Special registers:
    • r14 (SP): Stack operations
    • r15 (RA): Function return addresses

4. Arithmetic Logic Unit (ALU) ⭐

The heart of the processor - See ALU Design Deep Dive for details.

module alu (
    input  [31:0] a, b,     // Operands
    input  [3:0]  op,       // Operation selector
    output [31:0] y,        // Result
    output        zero      // Zero flag
);

Sub-components:

  • bk_adder32: Brent-Kung parallel-prefix adder
  • mul_tree32: Radix-4 Booth multiplier with CSA tree
  • div_nonrestoring32_bk: Non-restoring divider
  • sll32, srl32, sra32: Barrel shifters
  • Logic gates for AND, OR, XOR, NOT

5. Data Memory (DMEM)

module dmem (
    input         clk,
    input         re,          // Read enable
    input         we,          // Write enable
    input  [31:0] addr,        // Address
    input  [31:0] wdata,       // Write data
    output [31:0] rdata        // Read data
);

Features:

  • Read/write data memory
  • Synchronous access (on clock edge)
  • Word-addressed
  • Stores program variables and stack

6. Immediate Unit (IMMU)

module immu (
    input  [17:0] imm18,    // 18-bit immediate from instruction
    output [31:0] immx      // 32-bit sign-extended immediate
);

Functionality:

  • Sign-extends 18-bit immediate to 32 bits
  • Used for:
    • Immediate arithmetic (ADDI, SUBI, etc.)
    • Memory offsets (LD, ST)
    • Small constants

7. Branch Target Calculator

module branch_target (
    input  [26:0] off27,    // 27-bit offset
    input  [31:0] pc,       // Current PC
    output [31:0] target    // Branch target address
);

Functionality:

  • Calculates: target = PC + (sign_extend(off27) << 2)
  • Supports PC-relative addressing
  • Used for: BEQ, BGT, B, CALL

πŸ“‹ Instruction Set Architecture

Instruction Format

All instructions are 32 bits:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OP[4:0] β”‚ I β”‚ RD   β”‚ RS1  β”‚ RS2/ β”‚   IMM/OFFSET       β”‚
β”‚         β”‚   β”‚ [3:0]β”‚ [3:0]β”‚ [3:0]β”‚                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  31   27  26  25 22  21 18  17 14  13                 0

Fields:
- OP[4:0]:    5-bit operation code
- I:          Immediate flag (1 = immediate, 0 = register)
- RD[3:0]:    Destination register
- RS1[3:0]:   Source register 1
- RS2[3:0]:   Source register 2 (if I=0)
- IMM[17:0]:  18-bit immediate (if I=1), sign-extended
- OFFSET[26:0]: Branch offset (for branches)

Complete Instruction Set

Arithmetic Operations

Instruction Opcode Format Description Example
ADD 00000 R/I Add add r1, r2, r3 β†’ r1 = r2 + r3
SUB 00001 R/I Subtract sub r1, r2, r3 β†’ r1 = r2 - r3
MUL 00010 R/I Multiply mul r1, r2, r3 β†’ r1 = r2 Γ— r3
DIV 00011 R/I Divide div r1, r2, r3 β†’ r1 = r2 Γ· r3
MOD 00100 R/I Modulus mod r1, r2, r3 β†’ r1 = r2 % r3

Logical Operations

Instruction Opcode Format Description Example
AND 00110 R/I Bitwise AND and r1, r2, r3 β†’ r1 = r2 & r3
OR 00111 R/I Bitwise OR or r1, r2, r3 β†’ r1 = r2 | r3
NOT 01000 R/I Bitwise NOT not r1, r2 β†’ r1 = ~r2

Shift Operations

Instruction Opcode Format Description Example
LSL 01010 R/I Logical Shift Left lsl r1, r2, 4 β†’ r1 = r2 << 4
LSR 01011 R/I Logical Shift Right lsr r1, r2, 4 β†’ r1 = r2 >> 4
ASR 01100 R/I Arithmetic Shift Right asr r1, r2, 4 β†’ r1 = r2 >>> 4

Data Movement

Instruction Opcode Format Description Example
MOV 01001 R/I Move mov r1, r2 β†’ r1 = r2
CMP 00101 R/I Compare (sets flags) cmp r1, r2 β†’ flags = (r1 vs r2)

Memory Operations

Instruction Opcode Format Description Example
LD 01110 I Load ld r1, r2, 100 β†’ r1 = mem[r2+100]
ST 01111 I Store st r1, r2, 100 β†’ mem[r2+100] = r1

Note: Store encoding is special - the value to store is in RD field, not RS2.

Control Flow

Instruction Opcode Format Description Example
BEQ 10000 Branch Branch if Equal beq r1, r2, label β†’ if (r1 == r2) goto label
BGT 10001 Branch Branch if Greater bgt r1, r2, label β†’ if (r1 > r2) goto label
B 10010 Branch Unconditional Branch b label β†’ goto label
CALL 10011 Branch Function Call call func β†’ RA=PC+4; goto func
RET 10100 Branch Return ret β†’ goto RA
NOP 01101 - No Operation nop β†’ do nothing

Addressing Modes

Register-Register (R-type):

add r1, r2, r3    # r1 = r2 + r3

Register-Immediate (I-type):

addi r1, r2, 100  # r1 = r2 + 100

Memory Access:

ld  r1, r2, 100   # r1 = mem[r2 + 100]  (base + offset)
st  r1, r2, 100   # mem[r2 + 100] = r1

PC-Relative Branch:

beq r1, r2, label # if (r1 == r2) PC = PC + offset

⚑ ALU Design Deep Dive

The ALU is the performance-critical component of the processor, implementing all arithmetic and logical operations. It utilizes advanced algorithms to optimize speed, area, and power.

Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    a[31:0] ───────>β”‚                             β”‚
                    β”‚         ALU Core            β”‚
    b[31:0] ───────>β”‚                             │───> y[31:0]
                    β”‚  - Brent-Kung Adder         β”‚
    op[3:0] ───────>β”‚  - Booth Multiplier         │───> zero
                    β”‚  - Non-Restoring Divider    β”‚
                    β”‚  - Barrel Shifters          β”‚
                    β”‚  - Logic Gates              β”‚
                    β”‚                             β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

1. Brent-Kung Parallel-Prefix Adder

Used for: ADD, SUB (via 2's complement)

Algorithm:

Stage 1: Pre-processing (Generate & Propagate)
  P[i] = a[i] XOR b[i]
  G[i] = a[i] AND b[i]

Stage 2: Prefix Tree (logβ‚‚(32) = 5 levels)
  Level 1: Combine adjacent pairs
  Level 2: Combine groups of 4
  Level 3: Combine groups of 8
  Level 4: Combine groups of 16
  Level 5: Combine full width

Stage 3: Post-processing (Final Sum)
  Sum[i] = P[i] XOR Carry[i-1]

Complexity:

  • Time: O(log n) = 5 stages
  • Area: O(n log n) gates
  • Critical path: ~1.2 ns

Advantages:

  • Balanced fan-out
  • Low wire complexity
  • Scalable to larger widths
  • Faster than ripple-carry by ~6Γ—

Subtraction Implementation:

wire [31:0] b_neg = ~b + 1;  // Two's complement
// Use same adder: a + b_neg

2. Radix-4 Booth Multiplier with CSA Tree

Used for: MUL

Phase 1: Sign Handling

sign_a = a[31]
sign_b = b[31]
mag_a = sign_a ? (~a + 1) : a    // Absolute value
mag_b = sign_b ? (~b + 1) : b
result_sign = sign_a XOR sign_b

Phase 2: Booth Encoding

Scans multiplier in overlapping 3-bit windows, reducing 32 partial products to 16:

Window[2:0] Partial Product
000, 111 0
001, 010 +multiplicand
011 +2Γ—multiplicand
100 -2Γ—multiplicand
101, 110 -multiplicand

Phase 3: CSA Tree Reduction

16 partial products
   ↓ (3:2 CSA)
11 operands
   ↓ (3:2 CSA)
 8 operands
   ↓ (3:2 CSA)
 6 operands
   ↓ (3:2 CSA)
 4 operands
   ↓ (3:2 CSA)
 3 operands
   ↓ (3:2 CSA)
 2 operands (sum + carry)
   ↓ (CPA - final addition)
64-bit product (take lower 32 bits)

Phase 4: Sign Restoration

product = result_sign ? (~intermediate + 1) : intermediate

Complexity:

  • Time: ~12-15 logic levels (CSA tree + final CPA)
  • Area: Moderate (16 partial products + tree)
  • Critical path: ~8-10 ns

Advantages:

  • 50% reduction in partial products
  • Parallel CSA tree
  • Regular structure
  • Good synthesis results

3. Non-Restoring Divider

Used for: DIV, MOD

Algorithm: Fully unrolled 32-stage division

Initialization:

sign_dividend = dividend[31]
sign_divisor = divisor[31]
mag_dividend = sign_dividend ? (~dividend + 1) : dividend
mag_divisor = sign_divisor ? (~divisor + 1) : divisor
quotient_sign = sign_dividend XOR sign_divisor
remainder_sign = sign_dividend

Iteration (32 stages, fully combinational):

For i = 31 downto 0:
    partial_remainder = (remainder << 1) | dividend[i]
    
    if (partial_remainder >= 0):
        remainder = partial_remainder - divisor    (Brent-Kung adder)
        quotient[i] = 1
    else:
        remainder = partial_remainder + divisor    (Brent-Kung adder)
        quotient[i] = 0

Correction & Sign Restoration:

if (remainder < 0):
    remainder = remainder + divisor
    quotient = quotient - 1

quotient = quotient_sign ? (~quotient + 1) : quotient
remainder = remainder_sign ? (~remainder + 1) : remainder

Division by Zero Handling:

if (divisor == 0):
    quotient = 32'hFFFFFFFF
    remainder = dividend
    zero_flag = 1 (for DIV), 0 (for MOD)

Complexity:

  • Time: 32 stages Γ— ~5 levels per adder = ~160 logic levels
  • Area: 32 Γ— 32-bit adder = Very large
  • Critical path: ~38 ns (major bottleneck!)

Characteristics:

  • βœ… Single-cycle operation
  • βœ… No state machine required
  • βœ… Deterministic timing
  • ❌ Extremely long critical path
  • ❌ High area consumption

4. Barrel Shifters

Used for: SLL (Shift Left Logical), SRL (Shift Right Logical), SRA (Shift Right Arithmetic)

Algorithm: Logarithmic shifter with 5 stages

Stage 0: shift by 0 or 1   (controlled by shift_amount[0])
Stage 1: shift by 0 or 2   (controlled by shift_amount[1])
Stage 2: shift by 0 or 4   (controlled by shift_amount[2])
Stage 3: shift by 0 or 8   (controlled by shift_amount[3])
Stage 4: shift by 0 or 16  (controlled by shift_amount[4])

Arithmetic Shift Right:

  • Fill with sign bit: {sign_bit, data[31:1]}
  • Preserves sign for negative numbers

Complexity:

  • Time: O(log n) = 5 mux stages
  • Area: O(n log n) multiplexers
  • Critical path: ~1-2 ns

5. Logical Operations

Implemented directly:

AND:  y = a & b
OR:   y = a | b
XOR:  y = a ^ b
NOT:  y = ~b
PASS: y = b

Complexity: 1 logic level, negligible delay

6. Set-on-Less-Than (SLT)

// Perform signed subtraction
diff = a - b  (using Brent-Kung subtractor)

// Check MSB (sign bit)
slt_result = diff[31] ? 1 : 0

ALU Operation Selection

always @(*) begin
    case (op)
        `ALU_ADD:  y = bk_adder_result;
        `ALU_SUB:  y = bk_subtractor_result;
        `ALU_MUL:  y = booth_multiplier_result;
        `ALU_DIV:  y = nonrestoring_quotient;
        `ALU_MOD:  y = nonrestoring_remainder;
        `ALU_AND:  y = a & b;
        `ALU_OR:   y = a | b;
        `ALU_XOR:  y = a ^ b;
        `ALU_SLL:  y = barrel_shift_left;
        `ALU_SRL:  y = barrel_shift_right_logical;
        `ALU_SRA:  y = barrel_shift_right_arithmetic;
        `ALU_SLT:  y = {31'b0, set_if_less_than};
        `ALU_NOT:  y = ~b;
        `ALU_PASS: y = b;
        default:   y = 32'h0;
    endcase
    
    // Zero flag
    zero = (y == 32'd0);
end

πŸ”§ Assembler & Toolchain

Python Assembler Overview

The SimpleRISC toolchain includes a complete Python-based assembler for converting assembly language to machine code.

Features:

  • βœ… Full ISA support (21 instructions)
  • βœ… Label resolution for branches
  • βœ… Immediate value encoding
  • βœ… Error detection and reporting
  • βœ… Hex file generation
  • βœ… Listing file generation
  • βœ… Disassembler support

Usage

Basic Assembly:

# Convert assembly to hex
make assemble

## if make is not working
python tools/asm.py program.asm program.hex

Assembly Syntax

Comments:

# This is a comment
add r1, r2, r3    # Inline comment

Labels:

main:
    addi r1, r0, 10
loop:
    subi r1, r1, 1
    bgt  r1, r0, loop    # Branch to label

Directives:

.text               # Code section (default)
.data               # Data section

Example : Fibonacci Sequence

# fibonacci.asm - Calculate first 10 Fibonacci numbers
.text
    movi r1, 0           # fib(0) = 0
    movi r2, 1           # fib(1) = 1
    movi r3, 10          # count = 10
    movi r4, 0           # index = 0
    
loop:
    beq  r4, r3, done    # if index == count, exit
    
    # Store current number
    st   r1, r4, 0x1000  # mem[0x1000 + index*4] = fib(n)
    
    # Calculate next
    add  r5, r1, r2      # next = fib(n) + fib(n-1)
    mov  r1, r2          # fib(n) = fib(n-1)
    mov  r2, r5          # fib(n-1) = next
    
    addi r4, r4, 1       # index++
    b    loop
    
done:
    halt

# Results in memory:
# mem[0x1000] = 0
# mem[0x1001] = 1
# mem[0x1002] = 1
# mem[0x1003] = 2
# mem[0x1004] = 3
# mem[0x1005] = 5
# mem[0x1006] = 8
# mem[0x1007] = 13
# mem[0x1008] = 21
# mem[0x1009] = 34

Instruction Encoding Examples

R-Type (Register-Register):

Assembly: add r3, r1, r2
Binary:   00000 0 0011 0001 0010 00000000000000
Hex:      0x00C60000

Breakdown:
  OP=00000 (ADD)
  I=0 (use register)
  RD=0011 (r3)
  RS1=0001 (r1)
  RS2=0010 (r2)

I-Type (Immediate):

Assembly: addi r1, r0, 42
Binary:   00000 1 0001 0000 000000000000101010
Hex:      0x4884002A

Breakdown:
  OP=00000 (ADD)
  I=1 (use immediate)
  RD=0001 (r1)
  RS1=0000 (r0)
  IMM=42 (0x2A)

Testing Workflow

  1. Write Assembly Code
  2. Convert Assembly to Hex by assembler
  3. Run tb_simplerisc.v against your hex codes by:
# make command to run simplerisc
make test_simplerisc

## if make is not working
iverilog -g2012 -I rtl -o build/tb_simplerisc.vvp rtl/*.v  tb/tb_simplerisc.v
vvp build/tb_simplerisc.vvp
  1. Analyze the wave dumps by GTKWave / Verilog

πŸ“Š Performance Analysis

Timing Summary

Component Logic Levels Estimated Delay Max Frequency
Program Counter 1 ~0.5 ns N/A
Instruction Fetch 5 ~2 ns N/A
Decode 2 ~1 ns N/A
ALU - ADD/SUB 5 ~1.2 ns ~800 MHz
ALU - Logic 1 ~0.3 ns ~3 GHz
ALU - Shifts 5 ~1.5 ns ~650 MHz
ALU - MUL 30 ~10 ns ~100 MHz
ALU - DIV 160+ ~38 ns ~26 MHz
Memory Access 10 ~3 ns N/A
Writeback 1 ~0.5 ns N/A

Critical Path Analysis

Worst Case (Division):

PC β†’ IMEM β†’ Decode β†’ RegFile β†’ ALU (DIV) β†’ WB
  0.5ns + 2ns + 1ns + 1ns + 38ns + 0.5ns = ~43 ns

Maximum Frequency: ~23 MHz

Without Division (Multiplication):

PC β†’ IMEM β†’ Decode β†’ RegFile β†’ ALU (MUL) β†’ WB
  0.5ns + 2ns + 1ns + 1ns + 10ns + 0.5ns = ~15 ns

Maximum Frequency: ~66 MHz

Simple Operations (ADD/Logic):

PC β†’ IMEM β†’ Decode β†’ RegFile β†’ ALU (ADD) β†’ WB
  0.5ns + 2ns + 1ns + 1ns + 1.2ns + 0.5ns = ~6.2 ns

Maximum Frequency: ~160 MHz

Performance vs Target

Metric Target Achieved Status
Full ALU 250 MHz ~23 MHz ❌ 10Γ— too slow
Without DIV 250 MHz ~66 MHz ❌ 3.8Γ— too slow
Basic Ops 250 MHz ~160 MHz ⚠️ 1.5Γ— too slow

Why 250 MHz is Challenging

The 250 MHz target (4 ns period) is difficult for several fundamental reasons:

  1. Division Bottleneck: 38 ns critical path (9.5Γ— target)

    • 32 cascaded Brent-Kung adders
    • Fully combinational implementation
    • Unavoidable in single-cycle architecture
  2. Multiplication Path: 10 ns (2.5Γ— target)

    • Could reach ~200 MHz with aggressive optimization
    • Still short of 250 MHz
  3. Single-Cycle Constraint:

    • All operations must complete in one clock
    • Cannot pipeline or use multi-cycle execution
    • Limits optimization options

Solutions that would work:

  • βœ… Multi-cycle execution (DIV = 32 cycles, MUL = 3 cycles)
  • βœ… Pipelined ALU (3-5 stage pipeline)
  • βœ… Separate fast/slow execution paths

What doesn't work:

  • ❌ Single-cycle combinational DIV/MUL at 250 MHz
  • ❌ Faster algorithms (limited by physics)

πŸ“ Project Structure

simplerisc-processor/
β”œβ”€β”€ rtl/                          # RTL source files
β”‚   β”œβ”€β”€ simplerisc_top.v         # Top-level processor
β”‚   β”œβ”€β”€ alu.v                    # ALU (optimized)
β”‚   β”‚   β”œβ”€β”€ bk_adder32.v        # Brent-Kung adder
β”‚   β”‚   β”œβ”€β”€ bk_subtractor32.v   # Brent-Kung subtractor
β”‚   β”‚   β”œβ”€β”€ mul_tree32.v        # Booth multiplier
β”‚   β”‚   β”œβ”€β”€ div_nonrestoring32_bk.v
β”‚   β”‚   β”œβ”€β”€ sll32.v             # Barrel shifter (left)
β”‚   β”‚   β”œβ”€β”€ srl32.v             # Barrel shifter (right logical)
β”‚   β”‚   β”œβ”€β”€ sra32.v             # Barrel shifter (right arithmetic)
β”‚   β”‚   └── slt.v               # comparison of 2 var (smaller than)
β”‚   β”œβ”€β”€ control_unit.v           # Instruction decoder
β”‚   β”œβ”€β”€ regfile.v                # Register file
β”‚   β”œβ”€β”€ imem.v                   # Instruction memory
β”‚   β”œβ”€β”€ dmem.v                   # Data memory
β”‚   β”œβ”€β”€ immu.v                   # Immediate unit
β”‚   β”œβ”€β”€ decode.vh                # ISA definitions
β”‚   └── branch_target.v          # Branch target calculator
β”‚
β”œβ”€β”€ testbench/
β”‚   β”œβ”€β”€ alu_tb.v                 # ALU unit tests
β”‚   └── tb_simplerisc.v          # complete simplerisc test
β”‚
β”œβ”€β”€ tools/
β”‚   └──  asm.py                  # ASM β†’ HEX assembler
β”‚
β”œβ”€β”€ programs/
β”‚   β”œβ”€β”€ test_arithmetic.asm      # Arithmetic tests
β”‚   β”œβ”€β”€ test_memory.asm          # Load/store tests
β”‚   β”œβ”€β”€ test_branches.asm        # Control flow tests
β”‚   β”œβ”€β”€ fibonacci.asm            # Fibonacci sequence
β”‚   β”œβ”€β”€ factorial.asm            # Factorial calculation
β”‚   β”œβ”€β”€ bubble_sort.asm          # Sorting algorithm
β”‚   └── gcd.asm                  # GCD algorithm
|
β”œβ”€β”€ program.asm
β”œβ”€β”€ program.hex
β”œβ”€β”€ Makefile
└── README.md

βš–οΈ Design Trade-offs

1. Single-Cycle vs Multi-Cycle

Choice: Single-Cycle

Aspect Single-Cycle Multi-Cycle
CPI Always 1 1-32 (variable)
Clock Frequency ~23 MHz 250+ MHz
Control Logic Simple Complex (FSM)
Throughput 23 MIPS 50-250 MIPS
Latency Predictable Variable

Justification: Educational clarity, simple control, predictable timing

2. Adder Selection

Choice: Brent-Kung

Adder Type Delay Wiring Area
Ripple-Carry O(n) Simple Tiny
Brent-Kung O(log n) Moderate Medium
Kogge-Stone O(log n) Complex Large

Justification: Best balance of speed, area, and routability

3. Multiplier Architecture

Choice: Radix-4 Booth + CSA Tree

Alternatives:

  • Array multiplier: Simple but slow (O(nΒ²))
  • Wallace tree: Slightly faster, more irregular
  • Dadda tree: Similar to Wallace
  • Sequential: Much smaller area, much slower

Justification: Good speed/area balance, regular structure

4. Divider Implementation

Choice: Fully Unrolled Non-Restoring

Alternatives:

  • Sequential restoring: 32 cycles, tiny area
  • SRT division: Complex, needs lookup tables
  • Convergence division: Iterative, variable latency

Justification: Single-cycle requirement forced fully combinational design

The Problem: Creates massive critical path (~38 ns)

Better Alternatives (require architectural changes):

  • Multi-cycle sequential divider (32 cycles, ~1 MHz throughput)
  • Pipelined SRT divider (complex but high throughput)

πŸ’‘ Lessons Learned

Technical Insights

  1. Architecture Constrains Performance

    • Single-cycle requirement fundamentally limits frequency
    • Division/multiplication need pipelining for high performance
    • Trade-off between CPI and clock frequency
  2. Algorithm Selection Matters

    • Brent-Kung: excellent adder choice
    • Booth encoding: essential for multiplication
    • Barrel shifters: simple and effective
    • Non-restoring division: functional but too slow
  3. Area-Speed Trade-offs

    • Can't optimize all three: area, speed, power
    • Division uses massive area for marginal performance gain
    • Simple operations easily meet timing
  4. Verification is Critical

    • Assembly-level testing catches integration bugs
    • Unit tests verify component correctness
    • Edge cases reveal design issues

Design Recommendations

For Educational Projects:

  • βœ… Single-cycle: Simple, easy to understand
  • βœ… Basic operations: Meet timing easily
  • ⚠️ Complex arithmetic: Use multi-cycle or omit

For Performance:

  • βœ… Multi-cycle execution for DIV (32 cycles)
  • βœ… 3-stage pipeline for MUL
  • βœ… Single-cycle for ADD/SUB/Logic
  • βœ… Target: 250+ MHz for simple operations

For Area Efficiency:

  • βœ… Sequential divider (tiny area)
  • βœ… Iterative multiplier
  • βœ… Ripple-carry adder (if speed not critical)

πŸ› οΈ Built With

Languages

  • Verilog (IEEE 1364-2005) - Hardware description
  • Python 3.x - Toolchain development
  • Assembly - SimpleRISC ISA

Tools

  • Icarus Verilog - Open-source simulation
  • ModelSim/VCS - Commercial simulation (optional)
  • GTKWave - Waveform viewing
  • Make - Build automation

🎯 Performance Summary

Metric Value Notes
Instruction Set 21 instructions Full RISC ISA
Data Path 32-bit Industry standard
Registers 16 Γ— 32-bit Including SP, RA
Max Frequency (Full) ~23 MHz Limited by division
Max Frequency (No DIV) ~66 MHz Limited by multiplication
Max Frequency (Basic) ~160 MHz ADD/SUB/Logic only
CPI 1.0 Single-cycle execution
Functional Correctness βœ… 100% All tests pass

About

A complete single-cycle RISC processor implementation in Verilog featuring an optimized ALU with advanced arithmetic algorithms

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published