Friday, April 21, 2023
3:50 – 4:50 p.m. (CST)
Dept. Electrical and Computer Engineering
University of Pittsburgh
Title: “CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture”
- Which platform beats 7nm GPU A100 in energy efficiency? AMD Versal ACAP (FPGA+AI Chip)!
- How to program AMD Versal ACAP, i.e., FPGA + AI Chip within the same chip die for deep learning applications in 10 lines of code? Use CHARM!
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? In this talk, we will discuss CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
Peipei Zhou is an assistant professor of the Electrical Computer Engineering (ECE) department at the University of Pittsburgh. She has over 10 years of experience in hardware and software co-design. She has published 20+ papers in top-tier IEEE/ACM computer system and design automation conferences and journals including FPGA, FCCM, DAC, ICCAD, ISPASS, TCAD, TECS, TODAES, IEEE Micro, etc. The algorithm and tool proposed in her FCCM’18 paper have been realized in the commercial Vitis HLS (high-level synthesis) compiler from Xilinx (acquired by AMD in Feb 2022). Her work in FPGA acceleration for deep learning won the 2019 Donald O. Pederson Best Paper Award from the IEEE Council for Design Automation (CEDA). Her work in cloud-based application optimization won the 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Best Paper Nominee and her work in FPGA acceleration for computer vision won the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Best Paper Nominee. Before joining Pitt, she worked as a full-time staff software engineer in a start-up company and led a team of 6 members to develop CNN and MM kernels in the deep learning libraries for two generations of AI training application-specific integrated circuit (ASIC) chip products.
More on Dr. Zhou:
Google Scholar: https://scholar.google.com/citations?user=px_jwFgAAAAJ&hl=en
More on CESG Seminars: HERE
Please join on Friday, 4/21/22 at via Zoom (see emails or syllabus for link and password)