Technical Project
Deep Dive

A detailed breakdown of my work in Hardware Architecture, RTL Design, and Embedded Firmware.

FPGA-Based CNN Accelerator & Memory Hierarchy

Tech Stack: VHDL, Python, Vivado, ModelSim

Designed a fully pipelined hardware accelerator for 3×3 RGB convolution operations, targeting high-throughput image processing applications on FPGA fabric. The architecture prioritizes memory bandwidth efficiency through a custom buffering strategy.

Pipelined Datapath: Implemented a high-performance datapath achieving one output pixel per clock cycle (after initial pipeline fill latency), significantly outperforming sequential CPU execution for identical workloads.
Custom Memory Architecture: Architected a "Sliding Window" buffer and packed RGB memory format. This design minimizes off-chip memory access overhead by reusing pixel data across convolution windows, reducing bandwidth requirements and simplifying address generation logic.
Verification Strategy: Validated cycle-accurate behavior using a Python reference model to generate golden vectors. Developed a comprehensive VHDL testbench with valid-bit tracking to verify data integrity across all pipeline stages.

View Codebase

32-bit MIPS Microarchitecture Design

Tech Stack: VHDL, Assembly, Quartus

Designed and implemented a soft-core 32-bit RISC processor based on the MIPS instruction set architecture. The design focuses on pipeline efficiency and hazard resolution without software intervention.

5-Stage Pipeline: Implemented a classic Fetch, Decode, Execute, Memory, Write-Back pipeline structure to maximize instruction throughput.
Hazard Resolution: Engineered hardware-based hazard mitigation, including a Forwarding Unit to resolve data dependencies (Read-After-Write) and a Hazard Detection Unit to insert stall cycles only when necessary (e.g., Load-Use hazards).
Subsystem Integration: Built and integrated word-aligned Instruction and Data Memory modules, a general-purpose Register File, and a custom ALU supporting arithmetic, logical, and branch operations.

View Codebase

Real-Time Ray Tracing on Cortex-M7

Tech Stack: C++, Arduino/Teensy

Ported a computationally intensive ray tracing engine to the Teensy 4.1 microcontroller (ARM Cortex-M7), demonstrating optimization techniques for resource-constrained embedded systems.

Embedded Optimization: Optimized floating-point arithmetic and memory usage to run a full ray-tracing algorithm on a 600MHz microcontroller with limited SRAM.
Render Pipeline: Implemented sphere intersection logic, material diffusion, and camera rays within the constraints of an embedded runtime environment.
Performance Tuning: Tuned calculation precision and recursion depth to balance image fidelity with render time, proving the viability of complex graphics algorithms on bare-metal hardware.

View Codebase

Quantitative Analysis of Cache Replacement Algorithms

Tech Stack: C++, ChampSim, Python

Conducted a rigorous performance analysis of memory hierarchy designs using trace-driven simulation. This project explored the architectural trade-offs between cache size, associativity, and eviction policies.

Simulation Framework: Developed a custom C++ cache simulator compatible with ChampSim traces to model multi-level cache hierarchies.
Algorithm Comparison: Implemented and compared Least Recently Used (LRU) and Least Frequently Used (LFU) replacement policies. Analyzed their impact on Hit/Miss rates across diverse workload traces (10M+ instructions).
Data Analysis: Quantified the relationship between associativity and Average Memory Access Time (AMAT), generating data-driven insights into optimal cache configurations for specific access patterns.

View Codebase

Heterogeneous Computing & Custom Kernel Modules

Tech Stack: C, CUDA, Linux Kernel API

Developed a hybrid hardware-software solution on the NVIDIA Jetson Nano, interfacing user-space applications with custom kernel drivers and GPU acceleration.

Kernel Development: Wrote custom Linux character device drivers to manage data transfer between user space and kernel space, exposing hardware resources via standard file operations (open/read/write/ioctl).
CUDA Acceleration: Designed optimized CUDA kernels to offload intensive image processing tasks to the GPU, achieving significant speedup compared to CPU-only execution.
System Integration: Orchestrated the full data pipeline: loading image data via kernel module, passing it to the GPU memory space, executing parallel kernels, and retrieving results for display.

View Codebase