Hermes Project
Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation at Scale
Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta

June 2024 - June 2025

  • Problem: The Retrieval Step for RAG based LLMs introduces a tremendous overhead for at scale datastores.
  • Solution: Hermes co-designs algorithms and systems for scalable Retrieval-Augmented Generation (RAG).
  • Innovation: Partitions and distributes datastores, uses hierarchical search for efficiency.
  • Impact: Achieves up to 9.33× latency and 2.10× energy improvements on large-scale RAG.
Read more

The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-to-end latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.

GitHub Artifacts Available Artifacts Functional Artifacts Reproduced

RAG Systems Project
Characterizing The Systems Implications of Retrieval Augmented Generation
Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta

August 2023 - June 2024

  • Focus: Analyzing performance tradeoffs of RAG for LLMs.
  • Findings: RAG reduces retraining but can double inference latency and consume terabytes of storage.
  • Contribution: Presents taxonomy and characterization of RAG systems for LLMs.
Read more

The rapid increase in the number of parameters in large language models (LLMs) has significantly increased the cost involved in fine-tuning and retraining LLMs, a necessity for keeping models up to date and improving accuracy. Retrieval-Augmented Generation (RAG) offers a promising approach to improving the capabilities and accuracy of LLMs without the necessity of retraining. Although RAG eliminates the need for continuous retraining to update model data, it incurs a trade-off in the form of slower model inference times. Resultingly, the use of RAG in enhancing the accuracy and capabilities of LLMs often involves diverse performance implications and trade-offs based on its design. In an effort to begin tackling and mitigating the performance penalties associated with RAG from a systems perspective, this paper introduces a detailed taxonomy and characterization of the different elements within the RAG ecosystem for LLMs that explore trade-offs within latency, throughput, and memory. Our study reveals underlying inefficiencies in RAG for systems deployment, that can result in TTFT latencies that are twice as long and unoptimized datastores that consume terabytes of storage.

FHE Project
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption
Kaustubh Shivdikar, Yuhui Bao, Rashmi Agarwal, Michael Shen, Gilbert Jonatan, Evelio Mora, Alexander Ingare, Neal Livesay, José L. Abellán, John Kim, Ajay Joshi, David Kaeli

November 2022 - April 2023

  • Focus: Accelerating Fully Homomorphic Encryption (FHE) on AMD GPUs using NaviSim simulator.
  • Challenge: FHE enables computation on encrypted data, but is computationally expensive.
  • Approach: Simulated architectural changes to improve FHE performance.
  • Result: Demonstrated microarchitectural improvements for post-quantum secure computation.
Read more

Fully Homomorphic Encryption (FHE) is an emerging technology that allows for computation on encrypted operands, making it an ideal solution for security in the cloud computing era. With the looming threat that quantum computing poses on once-trusted cryptographic schemes, lattice-based FHE schemes have the potential to provide post-quantum security against cryptanalytic attacks. Although modern FHE schemes offer unprecedented security, current implementations suffer from prohibitively high computational costs. In this research we leverage NaviSim, a GPU simulator that faithfully models AMD architectures, to demonstrate how microarchitectural changes can be made to accelerate the performance of FHE.

Top Pick in Hardware and Embedded Security
Odyssey Project
Odyssey: A Methodology for Rapidly Prototyping GPU Simulators
Michael Shen, Sreepathi Pai, Yifan Sun

January 2022 - March 2023

  • Goal: Develop a methodology for automatically calibrating modern GPU simulators.
  • Contribution: Wrote scripts for automatic parameter calibration and validation with OpenTuner.
  • Impact: Achieved simulator accuracy within 30% of native hardware output.
Read more

In this project we are aiming to develop a methodology for automatically calibrating modern GPU simulators against actual GPUS. I have led this initiative by writing a script that can automatically calibrate single simulator parameters against suites of benchmarks. In addition, I structured navisim parameters to be configurable with the Opentuner python library for autotuning and automated the testing of validation microbenchmarks. Using this methodology we have been able to calibrate simulator parameters to be within 30% accurate of native hardware output.

MOTION Project
MOTION: MAV Operated Tunnel Inspection using Object-classification Neural Networks
Christian Burwell*, Tianqi Huang*, Rohit Pal*, Michael Shen*, Harrison Sun*, Eagle Yuan*, Taskin Padir, Bahram Shafai

June 2022 - December 2022

  • System: Developed drone-based tunnel inspection using R-CNN and SLAM.
  • Innovation: Integrated sensor suite for crack detection and mapping in GPS-denied environments.
  • Outcome: Provided actionable analysis for civil engineers on infrastructure integrity.
Read more

Recent infrastructure collapses, such as the MBTA’s Government Center collapse, have highlighted the importance of safe and efficient methods for evaluating critical infrastructure. To combat this issue, we have developed a small Unmanned Aerial System that can detect and evaluate the risks associated with hazardous fractures within tunnel walls. In our proposed solution we are leveraging Region-based Convolutional Neural Networks (R-CNN) for an applied mask in computer vision, Simultaneous Localization And Mapping (SLAM) for global navigation in GPS-denied tunnels, and an integrated sensor suite for visualizing and interpreting crack integrity while maintaining flight capabilities in remote environments. With these techniques, we have the capability to deploy a tool that provides insightful analysis on various civil infrastructure evaluations to expert civil engineers. With the developed system we hope to provide the industrial and academic communities with a prototyped system that can help mitigate the impact of cracks in vulnerable infrastructure.

1st Place Electrical and Computer Engineering Capstone Project
Yori Project
Yori: Mitigating The Effects Of Side Channel Attacks With RISCV-BOOM
Michael Shen, Derek Rodriguez, David Kaeli

November 2021 - November 2022

  • Focus: Security research on high-performance CPU microarchitectures and timing side channel attacks.
  • Method: Explored branch predictor designs on SonicBOOM RISC-V architecture.
  • Insight: Evaluated performance and security tradeoffs of speculative execution features.
Read more

Security research targeting today’s high-performance CPU microarchitectures helps to ensure that tomorrow’s program execution will be secure and reliable. With the adoption of branch predictors and speculative execution to overcome data and control dependencies on nearly every microprocessor on the market today, timing side channel attacks have become a critical issue. In this project we explore how different branch predictor designs, implemented on the SonicBOOM RISC-V architecture, can improve performance, but are also susceptible to side channels.

NaviSim Project
NaviSim: A Highly Accurate GPU Simulator for AMD RDNA GPUs
Yuhui Bao, Yifan Sun, Zlatan Feric, Michael Shen, Micah Weston, José L. Abellán, Trinayan Baruah, John Kim, Ajay Joshi, David Kaeli

June 2020 - April 2022

  • Contribution: Developed NaviSim, the first cycle-level simulator for AMD RDNA GPUs.
  • Validation: Tuned and validated NaviSim with microbenchmarks and 10 full workloads.
  • Result: Achieved kernel execution time accuracy within 9.92% of real hardware.
  • Impact: Enables accurate research on next-generation GPU architectures.
Read more

As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture. In this project, we present the NaviSim simulator, the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU’s kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU.[1][2]

GitLab Artifacts Available Artifacts Functional Artifacts Reproduced