Technical Project
Deep Dive
A detailed breakdown of my work in Hardware Architecture, RTL Design, and Embedded Firmware.
FPGA-Based CNN Accelerator & Memory Hierarchy
Tech Stack: VHDL, Python, Vivado, ModelSim
Designed a fully pipelined hardware accelerator for 3×3 RGB convolution operations, targeting high-throughput image processing applications on FPGA fabric. The architecture prioritizes memory bandwidth efficiency through a custom buffering strategy.
- Pipelined Datapath: Implemented a high-performance datapath achieving one output pixel per clock cycle (after initial pipeline fill latency), significantly outperforming sequential CPU execution for identical workloads.
- Custom Memory Architecture: Architected a "Sliding Window" buffer and packed RGB memory format. This design minimizes off-chip memory access overhead by reusing pixel data across convolution windows, reducing bandwidth requirements and simplifying address generation logic.
- Verification Strategy: Validated cycle-accurate behavior using a Python reference model to generate golden vectors. Developed a comprehensive VHDL testbench with valid-bit tracking to verify data integrity across all pipeline stages.
32-bit MIPS Microarchitecture Design
Tech Stack: VHDL, Assembly, Quartus
Designed and implemented a soft-core 32-bit RISC processor based on the MIPS instruction set architecture. The design focuses on pipeline efficiency and hazard resolution without software intervention.
- 5-Stage Pipeline: Implemented a classic Fetch, Decode, Execute, Memory, Write-Back pipeline structure to maximize instruction throughput.
- Hazard Resolution: Engineered hardware-based hazard mitigation, including a Forwarding Unit to resolve data dependencies (Read-After-Write) and a Hazard Detection Unit to insert stall cycles only when necessary (e.g., Load-Use hazards).
- Subsystem Integration: Built and integrated word-aligned Instruction and Data Memory modules, a general-purpose Register File, and a custom ALU supporting arithmetic, logical, and branch operations.
Real-Time Ray Tracing on Cortex-M7
Tech Stack: C++, Arduino/Teensy
Ported a computationally intensive ray tracing engine to the Teensy 4.1 microcontroller (ARM Cortex-M7), demonstrating optimization techniques for resource-constrained embedded systems.
- Embedded Optimization: Optimized floating-point arithmetic and memory usage to run a full ray-tracing algorithm on a 600MHz microcontroller with limited SRAM.
- Render Pipeline: Implemented sphere intersection logic, material diffusion, and camera rays within the constraints of an embedded runtime environment.
- Performance Tuning: Tuned calculation precision and recursion depth to balance image fidelity with render time, proving the viability of complex graphics algorithms on bare-metal hardware.
Quantitative Analysis of Cache Replacement Algorithms
Tech Stack: C++, ChampSim, Python
Conducted a rigorous performance analysis of memory hierarchy designs using trace-driven simulation. This project explored the architectural trade-offs between cache size, associativity, and eviction policies.
- Simulation Framework: Developed a custom C++ cache simulator compatible with ChampSim traces to model multi-level cache hierarchies.
- Algorithm Comparison: Implemented and compared Least Recently Used (LRU) and Least Frequently Used (LFU) replacement policies. Analyzed their impact on Hit/Miss rates across diverse workload traces (10M+ instructions).
- Data Analysis: Quantified the relationship between associativity and Average Memory Access Time (AMAT), generating data-driven insights into optimal cache configurations for specific access patterns.
Heterogeneous Computing & Custom Kernel Modules
Tech Stack: C, CUDA, Linux Kernel API
Developed a hybrid hardware-software solution on the NVIDIA Jetson Nano, interfacing user-space applications with custom kernel drivers and GPU acceleration.
- Kernel Development: Wrote custom Linux character device drivers to manage data transfer between user space and kernel space, exposing hardware resources via standard file operations (open/read/write/ioctl).
- CUDA Acceleration: Designed optimized CUDA kernels to offload intensive image processing tasks to the GPU, achieving significant speedup compared to CPU-only execution.
- System Integration: Orchestrated the full data pipeline: loading image data via kernel module, passing it to the GPU memory space, executing parallel kernels, and retrieving results for display.