The rapid advancement of Large Language Models (LLMs) as well as the constantly expanding amount of data make keeping the latest models constantly up-to-date a challenge. The high computational cost required to constantly retrain models to handle evolving data has led to the development of Retrieval-Augmented Generation (RAG). RAG presents a promising solution that enables LLMs to access and incorporate real-time information from external datastores, thus minimizing the need for retraining to update the information available to an LLM. However, as the RAG datastores used to augment information expand into the range of trillions of tokens, retrieval overheads become significant, impacting latency, throughput, and energy efficiency. To address this, we propose Hermes, an algorithm-systems co-design framework that addresses the unique bottlenecks of large-scale RAG systems. Hermes mitigates retrieval latency by partitioning and distributing datastores across multiple nodes, while also enhancing throughput and energy efficiency through an intelligent hierarchical search that dynamically directs queries to optimized subsets of the datastore. On open-source RAG datastores and models, we demonstrate Hermes optimizes end-to-end latency and energy by up to 9.33× and 2.10×, without sacrificing retrieval quality for at-scale trillion token retrieval datastores.
The rapid increase in the number of parameters in large language models (LLMs) has significantly increased the cost involved in fine-tuning and retraining LLMs, a necessity for keeping models up to date and improving accuracy. Retrieval-Augmented Generation (RAG) offers a promising approach to improving the capabilities and accuracy of LLMs without the necessity of retraining. Although RAG eliminates the need for continuous retraining to update model data, it incurs a trade-off in the form of slower model inference times. Resultingly, the use of RAG in enhancing the accuracy and capabilities of LLMs often involves diverse performance implications and trade-offs based on its design. In an effort to begin tackling and mitigating the performance penalties associated with RAG from a systems perspective, this paper introduces a detailed taxonomy and characterization of the different elements within the RAG ecosystem for LLMs that explore trade-offs within latency, throughput, and memory. Our study reveals underlying inefficiencies in RAG for systems deployment, that can result in TTFT latencies that are twice as long and unoptimized datastores that consume terabytes of storage.
Fully Homomorphic Encryption (FHE) is an emerging technology that allows for computation on encrypted operands, making it an ideal solution for security in the cloud computing era. With the looming threat that quantum computing poses on once-trusted cryptographic schemes, lattice-based FHE schemes have the potential to provide post-quantum security against cryptanalytic attacks. Although modern FHE schemes offer unprecedented security, current implementations suffer from prohibitively high computational costs. In this research we leverage NaviSim, a GPU simulator that faithfully models AMD architectures, to demonstrate how microarchitectural changes can be made to accelerate the performance of FHE.
In this project we are aiming to develop a methodology for automatically calibrating modern GPU simulators against actual GPUS. I have led this initiative by writing a script that can automatically calibrate single simulator parameters against suites of benchmarks. In addition, I structured navisim parameters to be configurable with the Opentuner python library for autotuning and automated the testing of validation microbenchmarks. Using this methodology we have been able to calibrate simulator parameters to be within 30% accurate of native hardware output.
Recent infrastructure collapses, such as the MBTA’s Government Center collapse, have highlighted the importance of safe and efficient methods for evaluating critical infrastructure. To combat this issue, we have developed a small Unmanned Aerial System that can detect and evaluate the risks associated with hazardous fractures within tunnel walls. In our proposed solution we are leveraging Region-based Convolutional Neural Networks (R-CNN) for an applied mask in computer vision, Simultaneous Localization And Mapping (SLAM) for global navigation in GPS-denied tunnels, and an integrated sensor suite for visualizing and interpreting crack integrity while maintaining flight capabilities in remote environments. With these techniques, we have the capability to deploy a tool that provides insightful analysis on various civil infrastructure evaluations to expert civil engineers. With the developed system we hope to provide the industrial and academic communities with a prototyped system that can help mitigate the impact of cracks in vulnerable infrastructure.
Security research targeting today’s high-performance CPU microarchitectures helps to ensure that tomorrow’s program execution will be secure and reliable. With the adoption of branch predictors and speculative execution to overcome data and control dependencies on nearly every microprocessor on the market today, timing side channel attacks have become a critical issue. In this project we explore how different branch predictor designs, implemented on the SonicBOOM RISC-V architecture, can improve performance, but are also susceptible to side channels.
As GPUs continue to grow in popularity for accelerating demanding applications, such as high-performance computing and machine learning, GPU architects need to deliver more powerful devices with updated instruction set architectures (ISAs) and new microarchitectural features. The introduction of the AMD RDNA architecture is one example where the GPU architecture was dramatically changed, modifying the underlying programming model, the core architecture, and the cache hierarchy. To date, no publicly-available simulator infrastructure can model the AMD RDNA GPU, preventing researchers from exploring new GPU designs based on the state-of-the-art RDNA architecture. In this project, we present the NaviSim simulator, the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU’s kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU.[1][2]