Technical Project
Deep Dive

A detailed breakdown of my work in Hardware Architecture, RTL Design, and Embedded Firmware.

FPGA-Based CNN Accelerator & Memory Hierarchy

Tech Stack: VHDL, Python, Vivado, ModelSim

Designed a fully pipelined hardware accelerator for 3×3 RGB convolution operations, targeting high-throughput image processing applications on FPGA fabric. The architecture prioritizes memory bandwidth efficiency through a custom buffering strategy.

  • Pipelined Datapath: Implemented a high-performance datapath achieving one output pixel per clock cycle (after initial pipeline fill latency), significantly outperforming sequential CPU execution for identical workloads.
  • Custom Memory Architecture: Architected a "Sliding Window" buffer and packed RGB memory format. This design minimizes off-chip memory access overhead by reusing pixel data across convolution windows, reducing bandwidth requirements and simplifying address generation logic.
  • Verification Strategy: Validated cycle-accurate behavior using a Python reference model to generate golden vectors. Developed a comprehensive VHDL testbench with valid-bit tracking to verify data integrity across all pipeline stages.

32-bit MIPS Microarchitecture Design

Tech Stack: VHDL, Assembly, Quartus

Designed and implemented a soft-core 32-bit RISC processor based on the MIPS instruction set architecture. The design focuses on pipeline efficiency and hazard resolution without software intervention.

  • 5-Stage Pipeline: Implemented a classic Fetch, Decode, Execute, Memory, Write-Back pipeline structure to maximize instruction throughput.
  • Hazard Resolution: Engineered hardware-based hazard mitigation, including a Forwarding Unit to resolve data dependencies (Read-After-Write) and a Hazard Detection Unit to insert stall cycles only when necessary (e.g., Load-Use hazards).
  • Subsystem Integration: Built and integrated word-aligned Instruction and Data Memory modules, a general-purpose Register File, and a custom ALU supporting arithmetic, logical, and branch operations.

Real-Time Ray Tracing on Cortex-M7

Tech Stack: C++, Arduino/Teensy

Ported a computationally intensive ray tracing engine to the Teensy 4.1 microcontroller (ARM Cortex-M7), demonstrating optimization techniques for resource-constrained embedded systems.

  • Embedded Optimization: Optimized floating-point arithmetic and memory usage to run a full ray-tracing algorithm on a 600MHz microcontroller with limited SRAM.
  • Render Pipeline: Implemented sphere intersection logic, material diffusion, and camera rays within the constraints of an embedded runtime environment.
  • Performance Tuning: Tuned calculation precision and recursion depth to balance image fidelity with render time, proving the viability of complex graphics algorithms on bare-metal hardware.

Quantitative Analysis of Cache Replacement Algorithms

Tech Stack: C++, ChampSim, Python

Conducted a rigorous performance analysis of memory hierarchy designs using trace-driven simulation. This project explored the architectural trade-offs between cache size, associativity, and eviction policies.

  • Simulation Framework: Developed a custom C++ cache simulator compatible with ChampSim traces to model multi-level cache hierarchies.
  • Algorithm Comparison: Implemented and compared Least Recently Used (LRU) and Least Frequently Used (LFU) replacement policies. Analyzed their impact on Hit/Miss rates across diverse workload traces (10M+ instructions).
  • Data Analysis: Quantified the relationship between associativity and Average Memory Access Time (AMAT), generating data-driven insights into optimal cache configurations for specific access patterns.

Heterogeneous Computing & Custom Kernel Modules

Tech Stack: C, CUDA, Linux Kernel API

Developed a hybrid hardware-software solution on the NVIDIA Jetson Nano, interfacing user-space applications with custom kernel drivers and GPU acceleration.

  • Kernel Development: Wrote custom Linux character device drivers to manage data transfer between user space and kernel space, exposing hardware resources via standard file operations (open/read/write/ioctl).
  • CUDA Acceleration: Designed optimized CUDA kernels to offload intensive image processing tasks to the GPU, achieving significant speedup compared to CPU-only execution.
  • System Integration: Orchestrated the full data pipeline: loading image data via kernel module, passing it to the GPU memory space, executing parallel kernels, and retrieving results for display.