Designed a high-throughput hardware accelerator for Convolutional Neural Networks on an FPGA.
Architected a custom sliding-window memory buffer to minimize off-chip bandwidth and implemented a
fully pipelined RGB datapath in VHDL, achieving single-cycle pixel processing throughput.
Designed the microarchitecture and RTL for a 5-stage pipelined MIPS processor.
Implemented hardware-based hazard detection and a data forwarding unit to resolve
control and data dependencies without software bubbles, ensuring optimal pipeline efficiency.
Ported and optimized a ray-tracing engine to run on a Teensy 4.1 microcontroller (600MHz).
Addressed severe memory constraints and floating-point limitations to render 3D scenes
in real-time on an embedded target, demonstrating efficient firmware resource management.
Conducted trace-driven simulation using ChampSim to analyze the performance trade-offs of
LRU vs. LFU replacement policies. Quantified the impact on hit rates and Average Memory
Access Time (AMAT) across 10 million instruction cycles to inform architectural design decisions.
Developed custom Linux kernel modules for the NVIDIA Jetson Nano to interface with
userspace applications. Implemented a heterogeneous computing pipeline that offloads
image processing tasks to the GPU via CUDA while managing data transfer through character device drivers.