HiDe Ceramics 2019

Alec Lu
Ph.D. Student, Heterogeneous Computing System Researcher, Potter

Discovery II Building
8888 University Dr.
Simon Fraser University
Burnaby, BC V5A 1S6

Email: alec_lu [at] sfu.ca


I am a PhD student in Computer Engineering at the Simon Fraser University, co-advised by Prof. Zhenman Fang and Prof. Lesley Shannon. My research interests include: FPGA-based custom accelerator design, big data analytics acceleration, and heterogeneous computing. I received my B.A.Sc. from Simon Fraser University in 2018. During my undergrad, I interned as a software developer at the Canadian Nuclear Labratories (previous known as Atomic Energy of Canada Limited), and as an SoC Designer at Intel. During my time in grad schoool, I have interned at Meta as an ASIC designer for their AR image signal processor.

Aside from engineering, I am also a potter and a cook. I thoroughly enjoy the process of making and being fully immersed, much like being in the 'Zone' when coding. I have worked as a cook at several local restaurants. Nowadays, I occasionally teach ceramics class at HiDe Ceramic Works, owned by pottery master HiDe Ebina. Some of my creations are posted here.

You can find my CV here. Or view my Google Scholar profile.


What's New

Mar 2023
"SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms" was accepted by FCCM'23!
Nov 2022
"ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for 8-Bit DNN Training" was accepted by DATE'23!
Nov 2022
"SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs" was accepted by TRETS'22!
Oct 2022
"HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers" was accepted by HPCA'23!
Jul 2022
"You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding" was accepted by ECCV'22!
Jun 2022
"Auto-ViT-Acc: FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization" was accepted by FPL'22!
Feb 2022
"Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking" was accepted by TRETS'22!
Nov 2021
"FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization" was accepted by FPGA'22!
Nov 2021
"Quick-Div: Rethinking Integer Divider Design for FPGA-based Soft-Processors" was accepted by TRETS'22!
Nov 2020
"Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking" was accepted by FPGA'21!
October 2020
"CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs" was accepted by FPT'20!
July 2020
I received the Graduate Fellowship Award from Simon Fraser University.
January 2020
I transfered from Master's Degree to Ph.D Degree in Computer Engineering.
October 2019
I received the Graduate Fellowship Award from Simon Fraser University.
February 2019
"Rethinking Integer Divider Design for FPGA-based Soft-Processors" was accepted by FCCM'19!
September 2018
I started my Master's Degree in Computer Engineering at SFU, and co-advised by Prof. Zhenman Fang and Prof. Lesley Shannon. I joined the HiAccel Lab directed by Prof. Zhenman Fang.
September 2018
I graduated from Simon Fraser University, and received my B.A.Sc. degree in Systems Engineering.
May 2018
I received the Undergraduate Research Award, and joined the Reconfigurable Computing Lab at Simon Fraser University!

Publications

Conference Publications (Full Papers)

C9

SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms FCCM'23

Alec Lu and Zhenman Fang.
The 31st IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM 2023), Marina Del Rey, CA, May 2023.
Acceptance Rate: 21.4%, 15 out of 70..

Today’s big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demon- strated promising performance improvement with good energy efficiency. However, few studies target the programming and design automation support to leverage the FPGA accelerator benefits in query processing. Most of them rely on the SQL query plan generated by CPU query engines, and manually map the query plan onto the FPGA accelerators, which is tedious and error prone. Moreover, such CPU-oriented query plans do not consider the utilization of FPGA accelerators and could lose more optimization opportunities. In this paper, we present SQL2FPGA, an FPGA accelerator- aware compiler to automatically map SQL queries onto the heterogeneous CPU-FPGA platforms. Our SQL2FPGA front-end takes an optimized logical plan of a SQL query from a database query engine, and transforms it into a unified operator-level intermediate representation. To generate an optimized FPGA- aware physical plan, SQL2FPGA implements a set of compiler optimization passes to 1) improve operator acceleration coverage by the FPGA, 2) eliminate redundant computation during phys- ical execution, and 3) minimize data transfer overhead between operators on the CPU and FPGA. Finally, SQL2FPGA generates the associated query acceleration code that can be deployed on the heterogeneous CPU-FPGA system. Compared to the widely used Apache Spark SQL framework running on the CPU, SQL2FPGA—using two AMD/Xilinx HBM-based Alveo U280 FPGA boards—achieves an average performance speedup of 10.1x and 13.9x across all 22 TPC-H benchmark queries in a scale factor of 1GB and 30GB, respectively.
@inproceedings{lu23sql2fpga, title={SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms}, author={Alec Lu and Zhenman Fang}, year={2023}, booktitle = {The 31st IEEE International Symposium On Field-Programmable Custom Computing Machines}, series = {FCCM'23}, location = {Marina Del Rey, CA}, numpages = {11}, pages = {} }
C8

ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for 8-Bit DNN Training DATE'23

Sung-En Chang, Geng Yuan, Alec Lu, Mengshu Sun, Yanyu Li, Xiaolong Ma, Zhengang Li, YanyueXie, Minghai Qin, Xue Lin, Zhenman Fang, and Yanzhi Wang.
Appeared in Design, Automation and Test in Europe Conference (DATE 2023), May 2023.
Acceptance Rate: 25%.

C7

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers HPCA'23

Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, and Yanzhi Wang.
To appear in the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023), Montreal, QC, Canada, Feb-Mar 2023.

C6

You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding ECCV'22

Geng Yuan, Sung-En Chang, Qing Jin, Alec Lu, Yanyu Li, Yushu Wu, Zhenglun Kong, Yanyue Xie, Peiyan Dong, Minghai Qin, Xiaolong Ma, Xulong Tang, Zhenman Fang, and Yanzhi Wang.
To appear in the European Conference on Computer Vision (ECCV 2022), Tel-Aviv, Israel, Oct 2022, acceptance rate: TBD

C5

Auto-ViT-Acc: FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization FPL'22

Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang.
To appear in the 32nd International Conference on Field-Programmable Logic and Applications (FPL 2022), acceptance rate: TBD

C4

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization FPGA'22

Mengshu Sun, Zhengang Li, Alec Lu, Yanyu Li, Sung-En Chang, Xiaolong Ma, Xue Lin, and Zhenman Fang.
The 32nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2022), Virtual Conference, Feb/Mar 2022, pp. 134–145, acceptance rate: 15/72 = 20.8%

With the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization work deploying quantization below 8-bit may be either suffering from evident accuracy loss or facing a big gap between the theoretical improvement of computation throughput and the practical inference speedup. In this work, we propose a general framework, called FILM-QNN, to quantize and accelerate multiple DNN models across different embedded FPGA devices. First, we propose the novel intra-layer, mixed-precision quantization algorithm that assigns different precisions onto the filters of each layer. The candidate precision levels and assignment granularity are determined from our empirical study with the capability of preserving accuracy and improving hardware parallelism. Second, we apply multiple optimization techniques for the FPGA accelerator architecture in support of quantized computations, including DSP packing, weight reordering, and data packing, to enhance the overall throughput with the available resources. Moreover, a comprehensive resource model is developed to balance the allocation of FPGA computation resources (LUTs and DSPs) as well as data transfer and on-chip storage resources (BRAMs) to accelerate the computations in mixed precisions within each layer. Finally, to improve the portability of FILM-QNN, we implement it using Vivado High-Level Synthesis (HLS) on Xilinx PYNQ-Z2 and ZCU102 FPGA boards. Our experimental results of ResNet-18, ResNet-50, and MobileNet-V2 demonstrate that the implementations with intra-layer, mixed-precision (95% of 4-bit weights and 5% of 8-bit weights, and all 5-bit activations) can achieve comparable accuracy (70.47%, 77.25%, and 65.67% for the three models) as the 8-bit (and 32-bit) versions and comparable throughput (214.8 FPS, 109.1 FPS, and 537.9 FPS on ZCU102) as the 4-bit designs.
@inproceedings{sunFilmQNN, title = {FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization}, author = {Sun, Mengshu and Li, Zhengang and Lu, Alec and Li, Yanyu and Chang, Sung-En and Ma, Xiaolong and Lin, Xue and Fang, Zhenman}, year = {2022}, booktitle = {Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, location = {Virtual Event, USA}, series = {FPGA '22}, numpages = {12}, pages = {134–145} }
C3

Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking FPGA'21

Alec Lu, Zhenman Fang, Weihua Liu, Lesley Shannon
29th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2021), acceptance rate: 22/111 = 19.8%

With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators.
@inproceedings{luChipKNN, title={Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking}, author={Alec Lu and Zhenman Fang and Weihua Liu and Lesley Shannon}, year={2021}, booktitle = {2021 International Symposium on Field-Programmable Gate Arrays}, series = {FPGA'21}, location = {Virtual Conference}, numpages = {11}, pages = {} }
C2

CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs FPT'20

Alec Lu, Zhenman Fang, Nazanin Farahpour, Lesley Shannon
2019 IEEE International Conference on Field-Programmable Technology (ICFPT 2020), acceptance rate: 21/85 = 24.7%

The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN---an HLS-based, configurable, and high-performance KNN accelerator---which optimizes the off-chip memory access on cloud FPGAs with multiple DRAM or HBM (high-bandwidth memory) banks. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension of each data point, the distance metric, and the number of nearest neighbors - K. To optimize its performance, we build an analytical performance model to explore the design space and balance the computation and memory access performance. Given a user configuration of the KNN parameters, our tool can automatically generate the optimal accelerator design on the given FPGA platform. Our experimental results on the Nimbix cloud computing platform show that: Compared to a 16-thread CPU implementation, CHIP-KNN on the Xilinx Alveo U200 FPGA board with four DRAM banks and U280 FPGA board with HBM achieves an average of 7.5x and 19.8x performance speedup, and 6.1x and 16.0x performance/dollar improvement.
@inproceedings{luChipKNN, title={CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs}, author={Alec Lu and Zhenman Fang and Nazanin Farahpour and Lesley Shannon}, year={2020}, booktitle = {2020 International Conference on Field-Programmable Technology}, series = {FPT'20}, location = {Virtual Conference}, numpages = {9}, pages = {} }
C1

Rethinking Integer Divider Design for FPGA-based Soft-Processors FCCM'19

Eric Matthews, Alec Lu, Lesley Shannon, Zhenman Fang
27th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2019), acceptance rate: 31/120 = 25.8%

Most existing soft-processors on FPGAs today support a fixed-latency instruction pipeline. Therefore, for integer division, a simple fixed-latency radix-2 integer divider is typically used, or algorithm-level changes are made to avoid integer divisions. However, for certain important application domains the simple radix-2 integer divider becomes the performance bottleneck, as every 32-bit division operation takes 32 cycles. In this paper, we explore integer divider designs for FPGA-based soft-processors, by leveraging the recent support of variable-latency execution units in their instruction pipeline. We implement a high-performance, data-dependent, variable-latency integer divider called Quick-Div, optimize its performance on FPGAs, and integrate it into a RISC-V soft-processor called Taiga that supports a variable-latency instruction pipeline. We perform a comprehensive analysis and comparison—in terms of cycles, clock frequency, and resource usage—for both the fixed-latency radix-2/4/8/16 dividers and our variable-latency Quick-Div divider with various optimizations. Experimental results on a Xilinx Virtex UltraScale+ VCU118 FPGA board show that our Quick-Div divider can provide over 5x better performance and over 4x better performance/LUT compared to a radix-2 divider for certain applications like random number generation. Finally, through a case study of integer square root, we demonstrate that our Quick-Div divider provides opportunities for reconsidering simpler and faster algorithmic choices.
@INPROCEEDINGS{8735506, author={E. {Matthews} and A. {Lu} and Z. {Fang} and L. {Shannon}}, booktitle={2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, title={Rethinking Integer Divider Design for FPGA-Based Soft-Processors}, year={2019}, volume={}, number={}, pages={289-297}, doi={10.1109/FCCM.2019.00046}}

Journal Articles

J3

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs TRETS'22

Xingyu Tian, Zhifan, Ye, Alec Lu, Zhenman Fang.
The ACM Transactions on Reconfigurable Technology and Systems (TRETS 2022).

J2

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking TRETS'22

Alec Lu, Zhenman Fang, and Lesley Shannon.
The ACM Transactions on Reconfigurable Technology and Systems (TRETS 2022), Volume 15, Issue 4, December 2022, Article No.: 43, pp 1–33.

J1

Quick-Div: Rethinking Integer Divider Design for FPGA-based Soft-Processors TRETS'22

Eric Matthews, Alec Lu, Zhenman Fang, and Lesley Shannon.
ACM Transactions on Reconfigurable Technology and Systems (TRETS 2022), Volume 15, Issue 3, September 2022, Article No.: 32, pp 1–27.

Conference Abstracts

A3

FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization DAC'22 LBR

Mengshu Sun, Zhengang Li, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, and Zhenman Fang.
To appear in the Design Automation Conference 2022 Late-Breaking Results (DAC 2022 LBR), San Francisco, CA, USA, Jul 2022.

A2

Hardware-Efficient Stochastic Rounding Unit Design for DNN Training DAC'22 LBR

Sung-En Chang, Geng Yuan, Alec Lu, Mengshu Sun, Yanyu Li, Xiaolong Ma, Zhengang Li, Yanyue Xie, Minghai Qin, Xue Lin, Zhenman Fang, and Yanzhi Wang.
To appear in the Design Automation Conference 2022 Late-Breaking Results (DAC 2022 LBR), San Francisco, CA, USA, Jul 2022.

A1

You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding DAC'22 WIP

Geng Yuan, Sung-En Chang, Qing Jin, Alec Lu, Yanyu Li, Yushu Wu, Zhenglun Kong, Yanyue Xie, Peiyan Dong, Xiaolong Ma, Xulong Tang, Minghai Qin, Zhenman Fang, and Yanzhi Wang.
To appear in the Design Automation Conference 2022 (DAC 2022 WIP), San Francisco, CA, USA, Jul 2022.


Projects

uBench

uBench is a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo FPGA memory systems under a comprehensive set of factors that affect the memory bandwidth, including 1) the clock frequency of the accelerator design, 2) the number of concurrent memory access ports, 3) the data width of each port, 4) the maximum burst access length for each port, and 5) the size of consecutive data accesses. uBench is open-source and publicly available on GitHub.

CHIP-KNN

CHIP-KNN is the framework for a configurable and high-performance K-Nearest Neighbors accelerator on cloud FPGAs. It automatically generates bandwidth-optimized KNN accelerator on cloud FPGA platforms. CHIP-KNN is open-source and publicly available on GitHub.

QuickDiv

QuickDiv is a high-performance, data-dependent, variable-latency integer divider. Its architecture is optimized for FPGAs. Currently it had been integrated as one of the functional units in a RISC-V soft-processor called Taiga, which supports a variable-latency instruction pipeline. QuickDiv is part of Taiga, both of which are open-source and publicly available on GitLab.


Awards

September 2021
Graduate Fellowship of Simon Fraser University
September 2020
Graduate Fellowship of Simon Fraser University
September 2019
Graduate Fellowship of Simon Fraser University
April 2018
Undergraduate Student Research Award, NSERC

Teaching

ENSC 251
Software Design and Analysis for Engineers, Summer 2019, Fall 2019, TA
ENSC 252
Fundamental of Digital Logic & Design, Fall 2019, TA
ENSC 350
Digital Systems Design, Fall 2018, TA
ENSC 462/894
Programming for Heterogeneous Computing Systems, Summer 2020, Summer 2021, TA