Thomas Randall @ Clemson University

Publications

2025 - 06 - 04

Is In-Context Learning Feasible for HPC Performance Autotuning?

A plot shows high variability in generated numbers from LLMs based on available in-context learning data.

We examine whether in-context learning with Large Language Models (LLMs) can effective address the challenges of High-Performance Computing (HPC) autotuning. LLMs have demonstrated remarkable natural language processing and artificial intelligence (AI) capabilities, sparking interest in their application across various domains, including HPC. Performance autotuning -- the process of automatically optimizing system configurations to maximize efficiency through empirical evaluation -- offers significant promise for enhancing ... application performance on larger systems and emerging architectures. However, this process remains computationally expensive due to the combinatorial explosion of configuration parameters and the complex, nonlinear relationships between configurations and performance outcomes. We pose a critical question: Can LLMs, without task-specific fine-tuning, accurately infer performance-configuration patterns by combining in-context examples with latent knowledge? To explore this, we leverage empirical performance data from real-world HPC systems, designing structured prompts and queries to evaluate LLMs' capabilities. Our experiments reveal inherent limitations in applying in-context learning to performance autotuning, particularly for tasks requiring precise mathematical reasoning and analysis of complex multivariate dependencies. We provide empirical evidence of these shortcomings and discuss potential research directions to overcome these challenges.

2024 - 11 - 02

Thermal Behaviors in Liquid Immersion Cooling under Various Workloads: a Case Study

Diagram showing the layout of a liquid immersion cooling pod. The top volume holds the server hardware (depicted as hot) and coolant. The coolant is circulated by a pump to a heat exchange with facility chilled water, such that hot coolant and cool water enter and cool coolant and hot water exit each system respectively.

The growing need for energy-efficient computing has led to many novel system innovations, including liquid immersion cooling. While many myths about the technology have been dispelled, the actual impact of this cooling solution on thermal conditions in real computing scenarios remains under-reported and ... under-studied. In this work, we collate data from multiple system monitoring tools to perform case-study analyses of the thermal behaviors of immersed hardware, aiming to evaluate the effectiveness of liquid immersion cooling for high-performance and datacenter applications.

2024 - 07 - 31

Poster for OMNI Internship @ Oak Ridge National Laboratory, 2024

Bar plots showing LLM capability to perform addition and modulus operations based on several factors.

Large Language Models (LLMs) capture a certain amount of world knowledge spanning many general and technical topics, including programming and performance. Without fine-tuning, the use of In-Context Learning (ICL) can specialize LLM outputs to perform complex ... tasks. In this work, we seek to demonstrate the regressive capabilities of LLMs in a performance modeling capacity. We find initial evidence that may limit LLM utility even after fine-tuning.

2023 - 06 - 21

Transfer-learning-based Autotuning using Gaussian Copula

A diagram showing the approach in two parts. The top portion covers Model Training, where an application has inputs of various sizes (shown: small, medium, and large) that are fed into a 'Non-GC Tuner' along with a derived component of the application labeled as 'User-Defined Tuning Space'. The tuner produces training data which is fed into the Gaussian Copula with the User-Defined Tuning Space. The bottom portion covers Model Inference, where the same application with new input sizes (shown: small-medium, medium-large, and extra-large) are presented to the fitted Gaussian Copula. The Gaussian Copula produces 'High Performing Configurations' which are then ranked by an 'Evaluator'.

As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. ... Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39× speedup, a dramatic improvement over the 20.58× speedup using prior techniques.

2021 - 06 - 04

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

One set of benchmark results indicate that FULL-W2V produces 1.13X speedup using a P100 GPU over prior SOTA using V100 GPU, increasing up to 4.35X speedup when FULL-W2V runs on the V100 architecture.

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. ... We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

Subscribe via RSS