KServe is an open-source, Kubernetes-native framework for deploying machine learning inference services. It supports both server-based deployments using standard Kubernetes resources and serverless deployments using Knative, enabling request-driven autoscaling and scale-to-zero capabilities. This project aims to evaluate the system-level performance (latency, throughput, resource usage) of server-based and serverless deployment strategies for classical machine learning prediction workloads using KServe. The study focuses on CPU-only inference services based on scikit-learn and XGBoost models. In the first phase, representative prediction machine learning models will be trained to generate inference workloads. In the second phase, these models will be deployed using KServe under two configurations: (i) Kubernetes-based deployments with Horizontal Pod Autoscaling (HPA), and (ii) Knative-based serverless deployments with request-driven autoscaling. A stream of controlled query workloads will be generated to simulate different traffic patterns. The evaluation will focus on latency, throughput, autoscaling responsiveness, CPU and memory utilization, and cold-start overhead. The results will highlight the trade-offs between performance, scalability, and resource efficiency in server and serverless ML serving environments, and how each deployment strategy can be adapted depending on the type of workloads.
References: Clipper: A Low-Latency Online Prediction Serving SystemSOCK: Rapid Task Provisioning with Serverless-Optimized ContainersSelfTune: Tuning Cluster ManagersHorizontal Pod AutoscalingKnative Technical OverviewKServe Documentation
Estimating Inference Latency of Deep Learning Models Using Roofline Analysis
Accurate estimation of inference latency is critical for meeting service-level objectives (SLOs) in large language model (LLM) serving systems. While classical ML prediction methods can be leveraged for the estimation task, their accuracy highly depends on the type of selected features. On the other hand, analytical performance models, such as the Roofline analysis, provide a hardware-aware upper bound on achievable performance; however, their applicability for latency estimation remains an open question. This project investigates how Roofline analysis can be integrated with ML prediction methods for improved estimation of end-to-end inference latency of LLM queries on a single GPU. A small set of representative LLMs will be selected, and inference latency will be measured under controlled conditions (Sequence length, batch size). Roofline-related metrics, such as arithmetic intensity and memory bandwidth utilization, will be collected using GPU profiling tools. These metrics will be used to estimate processing time and to build regression models that predict end-to-end inference latency. The evaluation will analyze prediction error, sensitivity to model size and input length, and the limitations of Roofline-based estimation.
References: Predicting LLM Inference Latency: A Roofline-Driven ML Method
Evaluating the Performance of vLLM and DeepSpeed for Serving LLM Inference Queries
The computational complexity of serving large language model (LLM) queries depends heavily on model size, sequence length, and memory access patterns. To address these challenges, several LLM inference serving frameworks have been proposed employing different optimization techniques to improve throughput and reduce memory overhead. vLLM and DeepSpeed are two prominent examples that deploy distinct techniques to achieve efficient inference serving frameworks. vLLM proposes PagedAttention for efficient key–value cache management. On the other hand, DeepSpeed integrates multiple optimization techniques, such as parallelism and kernel-level optimizations, for scalable inference. This project aims to systematically evaluate the end-to-end inference performance (Latency, throughput, Memory footprint) of vLLM and DeepSpeed under different inference workloads. Experiments will be performed using one of the publicly available datasets, such as ShareGPT. The results will highlight the trade-offs between KV cache management, kernel-level optimizations, and parallelism strategies in LLM inference serving, providing insights into the conditions under which each framework is most effective.
References: Efficient Memory Management for Large Language Model Serving with PagedAttentionDeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Estimating Time and Resource Usage of SLURM Jobs Using RLM
Efficient allocation of computational resources in high-performance computing (HPC) clusters requires accurate prediction of job runtime and resource requirements. Users often over-request CPU, memory, or time to avoid failures, which can lead to wasted resources and longer queue times. Therefore, predicting these requirements before job submission is critical for improving cluster utilization and scheduling efficiency. This project investigates how Regression Language Models (RLMs) can be used to estimate the time and resource usage of SLURM jobs based on submitted Bash scripts and job metadata. The study will use real job submission data from the Habrok HPC cluster.
References: Regression Language Models for Code
In this project you will implement a job generator process. As an input, a JSON configuration (JSON) file and jobs generation rate (jobs per unit time) will be provided. The configuration file contains metadata about different Deep learning jobs, such as the path to the executable file and required arguments. Your task is to design and implement a generator which works as follows: Randomly select a job from the JSON file, use the metadata of the selected job to prepare a YAML/Batch script, a script template will be provided, and submit the prepared script to another process using an RPC protocol. The rate at which a job is sampled and submitted should be equal to the given generation rate. In addition, an implementation of an RPC protocol is required, the description of the protocol will be provided. You may choose any programming language for coding, but advisably to use Python. You will be provided with a supplementary code with helper functions, RPC protocol description and script/configuration files description.
In this project we are developing a resource manager framework, part of this framework is to design monitoring APIs functionalities, which gathers hardware metrics upon invocation. You will be given a code template and your task is to fill in the code for some API functionalities. Supplementary programs and helper functions will be provided to be able to test your implementation. A documentation for the required API functions including their input and output parameters will be provided. This project requires C and Python programming as well as basic operating systems knowledge.