Research

Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation at Scale

Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta

June 2024 - June 2025

Problem: The Retrieval Step for RAG based LLMs introduces a tremendous overhead for at scale datastores.
Solution: Hermes co-designs algorithms and systems for scalable Retrieval-Augmented Generation (RAG).
Innovation: Partitions and distributes datastores, uses hierarchical search for efficiency.
Impact: Achieves up to 9.33× latency and 2.10× energy improvements on large-scale RAG.

The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-to-end latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.

GitHub Artifacts Available Artifacts Functional Artifacts Reproduced

Characterizing The Systems Implications of Retrieval Augmented Generation

Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta

August 2023 - June 2024

Focus: Analyzing performance tradeoffs of RAG for LLMs.
Findings: RAG reduces retraining but can double inference latency and consume terabytes of storage.
Contribution: Presents taxonomy and characterization of RAG systems for LLMs.

The rapid increase in the number of parameters in large language models (LLMs) has significantly increased the cost involved in fine-tuning and retraining LLMs, a necessity for keeping models up to date and improving accuracy. Retrieval-Augmented Generation (RAG) offers a promising approach to improving the capabilities and accuracy of LLMs without the necessity of retraining. Although RAG eliminates the need for continuous retraining to update model data, it incurs a trade-off in the form of slower model inference times. Resultingly, the use of RAG in enhancing the accuracy and capabilities of LLMs often involves diverse performance implications and trade-offs based on its design. In an effort to begin tackling and mitigating the performance penalties associated with RAG from a systems perspective, this paper introduces a detailed taxonomy and characterization of the different elements within the RAG ecosystem for LLMs that explore trade-offs within latency, throughput, and memory. Our study reveals underlying inefficiencies in RAG for systems deployment, that can result in TTFT latencies that are twice as long and unoptimized datastores that consume terabytes of storage.

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Kaustubh Shivdikar, Yuhui Bao, Rashmi Agarwal, Michael Shen, Gilbert Jonatan, Evelio Mora, Alexander Ingare, Neal Livesay, José L. Abellán, John Kim, Ajay Joshi, David Kaeli

November 2022 - April 2023

Focus: Accelerating Fully Homomorphic Encryption (FHE) on AMD GPUs using NaviSim simulator.
Challenge: FHE enables computation on encrypted data, but is computationally expensive.
Approach: Simulated architectural changes to improve FHE performance.
Result: Demonstrated microarchitectural improvements for post-quantum secure computation.

Fully Homomorphic Encryption (FHE) is an emerging technology that allows for computation on encrypted operands, making it an ideal solution for security in the cloud computing era. With the looming threat that quantum computing poses on once-trusted cryptographic schemes, lattice-based FHE schemes have the potential to provide post-quantum security against cryptanalytic attacks. Although modern FHE schemes offer unprecedented security, current implementations suffer from prohibitively high computational costs. In this research we leverage NaviSim, a GPU simulator that faithfully models AMD architectures, to demonstrate how microarchitectural changes can be made to accelerate the performance of FHE.

Top Pick in Hardware and Embedded Security

Odyssey: A Methodology for Rapidly Prototyping GPU Simulators

Michael Shen, Sreepathi Pai, Yifan Sun

January 2022 - March 2023

Goal: Develop a methodology for automatically calibrating modern GPU simulators.
Contribution: Wrote scripts for automatic parameter calibration and validation with OpenTuner.
Impact: Achieved simulator accuracy within 30% of native hardware output.

In this project we are aiming to develop a methodology for automatically calibrating modern GPU simulators against actual GPUS. I have led this initiative by writing a script that can automatically calibrate single simulator parameters against suites of benchmarks. In addition, I structured navisim parameters to be configurable with the Opentuner python library for autotuning and automated the testing of validation microbenchmarks. Using this methodology we have been able to calibrate simulator parameters to be within 30% accurate of native hardware output.

MOTION: MAV Operated Tunnel Inspection using Object-classification Neural Networks

Christian Burwell*, Tianqi Huang*, Rohit Pal*, Michael Shen*, Harrison Sun*, Eagle Yuan*, Taskin Padir, Bahram Shafai

June 2022 - December 2022

System: Developed drone-based tunnel inspection using R-CNN and SLAM.
Innovation: Integrated sensor suite for crack detection and mapping in GPS-denied environments.
Outcome: Provided actionable analysis for civil engineers on infrastructure integrity.

Recent infrastructure collapses, such as the MBTA’s Government Center collapse, have highlighted the importance of safe and efficient methods for evaluating critical infrastructure. To combat this issue, we have developed a small Unmanned Aerial System that can detect and evaluate the risks associated with hazardous fractures within tunnel walls. In our proposed solution we are leveraging Region-based Convolutional Neural Networks (R-CNN) for an applied mask in computer vision, Simultaneous Localization And Mapping (SLAM) for global navigation in GPS-denied tunnels, and an integrated sensor suite for visualizing and interpreting crack integrity while maintaining flight capabilities in remote environments. With these techniques, we have the capability to deploy a tool that provides insightful analysis on various civil infrastructure evaluations to expert civil engineers. With the developed system we hope to provide the industrial and academic communities with a prototyped system that can help mitigate the impact of cracks in vulnerable infrastructure.

1st Place Electrical and Computer Engineering Capstone Project

Yori: Mitigating The Effects Of Side Channel Attacks With RISCV-BOOM

Michael Shen, Derek Rodriguez, David Kaeli

November 2021 - November 2022

Focus: Security research on high-performance CPU microarchitectures and timing side channel attacks.
Method: Explored branch predictor designs on SonicBOOM RISC-V architecture.
Insight: Evaluated performance and security tradeoffs of speculative execution features.

Security research targeting today’s high-performance CPU microarchitectures helps to ensure that tomorrow’s program execution will be secure and reliable. With the adoption of branch predictors and speculative execution to overcome data and control dependencies on nearly every microprocessor on the market today, timing side channel attacks have become a critical issue. In this project we explore how different branch predictor designs, implemented on the SonicBOOM RISC-V architecture, can improve performance, but are also susceptible to side channels.

NaviSim: A Highly Accurate GPU Simulator for AMD RDNA GPUs

Yuhui Bao, Yifan Sun, Zlatan Feric, Michael Shen, Micah Weston, José L. Abellán, Trinayan Baruah, John Kim, Ajay Joshi, David Kaeli

June 2020 - April 2022

Contribution: Developed NaviSim, the first cycle-level simulator for AMD RDNA GPUs.
Validation: Tuned and validated NaviSim with microbenchmarks and 10 full workloads.
Result: Achieved kernel execution time accuracy within 9.92% of real hardware.
Impact: Enables accurate research on next-generation GPU architectures.

As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture. In this project, we present the NaviSim simulator, the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU’s kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU.^[1]^[2]

June 2024 - June 2025

GitHub Artifacts Available Artifacts Functional Artifacts Reproduced

August 2023 - June 2024

November 2022 - April 2023

January 2022 - March 2023

June 2022 - December 2022

November 2021 - November 2022

June 2020 - April 2022

GitLab Artifacts Available Artifacts Functional Artifacts Reproduced