Mahmoud Alasmar

Evaluating Server-Based and Serverless Deployment Strategies for Machine Learning Prediction Workloads in KServe

Supervisors: Mahmoud Alasmar, Alexander Lazovik
Date: 2026-01-09
Type: bachelor-project
Description:

KServe is an open-source, Kubernetes-native framework for deploying machine learning inference services. It supports both server-based deployments using standard Kubernetes resources and serverless deployments using Knative, enabling request-driven autoscaling and scale-to-zero capabilities. This project aims to evaluate the system-level performance (latency, throughput, resource usage) of server-based and serverless deployment strategies for classical machine learning prediction workloads using KServe. The study focuses on CPU-only inference services based on scikit-learn and XGBoost models. In the first phase, representative prediction machine learning models will be trained to generate inference workloads. In the second phase, these models will be deployed using KServe under two configurations: (i) Kubernetes-based deployments with Horizontal Pod Autoscaling (HPA), and (ii) Knative-based serverless deployments with request-driven autoscaling. A stream of controlled query workloads will be generated to simulate different traffic patterns. The evaluation will focus on latency, throughput, autoscaling responsiveness, CPU and memory utilization, and cold-start overhead. The results will highlight the trade-offs between performance, scalability, and resource efficiency in server and serverless ML serving environments, and how each deployment strategy can be adapted depending on the type of workloads. References:
Clipper: A Low-Latency Online Prediction Serving System SOCK: Rapid Task Provisioning with Serverless-Optimized Containers SelfTune: Tuning Cluster Managers Horizontal Pod Autoscaling Knative Technical Overview KServe Documentation

Estimating Inference Latency of Deep Learning Models Using Roofline Analysis

Supervisors: Mahmoud Alasmar, Alexander Lazovik
Date: 2026-01-09
Type: bachelor-project
Description:

Accurate estimation of inference latency is critical for meeting service-level objectives (SLOs) in large language model (LLM) serving systems. While classical ML prediction methods can be leveraged for the estimation task, their accuracy highly depends on the type of selected features. On the other hand, analytical performance models, such as the Roofline analysis, provide a hardware-aware upper bound on achievable performance; however, their applicability for latency estimation remains an open question. This project investigates how Roofline analysis can be integrated with ML prediction methods for improved estimation of end-to-end inference latency of LLM queries on a single GPU. A small set of representative LLMs will be selected, and inference latency will be measured under controlled conditions (Sequence length, batch size). Roofline-related metrics, such as arithmetic intensity and memory bandwidth utilization, will be collected using GPU profiling tools. These metrics will be used to estimate processing time and to build regression models that predict end-to-end inference latency. The evaluation will analyze prediction error, sensitivity to model size and input length, and the limitations of Roofline-based estimation. References:
Predicting LLM Inference Latency: A Roofline-Driven ML Method

Evaluating the Performance of vLLM and DeepSpeed for Serving LLM Inference Queries

Supervisors: Mahmoud Alasmar, Alexander Lazovik
Date: 2026-01-09
Type: master-project
Description:

The computational complexity of serving large language model (LLM) queries depends heavily on model size, sequence length, and memory access patterns. To address these challenges, several LLM inference serving frameworks have been proposed employing different optimization techniques to improve throughput and reduce memory overhead. vLLM and DeepSpeed are two prominent examples that deploy distinct techniques to achieve efficient inference serving frameworks. vLLM proposes PagedAttention for efficient key–value cache management. On the other hand, DeepSpeed integrates multiple optimization techniques, such as parallelism and kernel-level optimizations, for scalable inference. This project aims to systematically evaluate the end-to-end inference performance (Latency, throughput, Memory footprint) of vLLM and DeepSpeed under different inference workloads. Experiments will be performed using one of the publicly available datasets, such as ShareGPT. The results will highlight the trade-offs between KV cache management, kernel-level optimizations, and parallelism strategies in LLM inference serving, providing insights into the conditions under which each framework is most effective. References:
Efficient Memory Management for Large Language Model Serving with PagedAttention DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Estimating Time and Resource Usage of SLURM Jobs Using RLM

Supervisors: Mahmoud Alasmar, Alexander Lazovik
Date: 2026-01-09
Type: master-project/master-internship
Description:

Efficient allocation of computational resources in high-performance computing (HPC) clusters requires accurate prediction of job runtime and resource requirements. Users often over-request CPU, memory, or time to avoid failures, which can lead to wasted resources and longer queue times. Therefore, predicting these requirements before job submission is critical for improving cluster utilization and scheduling efficiency. This project investigates how Regression Language Models (RLMs) can be used to estimate the time and resource usage of SLURM jobs based on submitted Bash scripts and job metadata. The study will use real job submission data from the Habrok HPC cluster. References:
Regression Language Models for Code

DL Jobs Generator

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bachelor-internship
Description:

In this project you will implement a job generator process. As an input, a JSON configuration (JSON) file and jobs generation rate (jobs per unit time) will be provided. The configuration file contains metadata about different Deep learning jobs, such as the path to the executable file and required arguments. Your task is to design and implement a generator which works as follows: Randomly select a job from the JSON file, use the metadata of the selected job to prepare a YAML/Batch script, a script template will be provided, and submit the prepared script to another process using an RPC protocol. The rate at which a job is sampled and submitted should be equal to the given generation rate. In addition, an implementation of an RPC protocol is required, the description of the protocol will be provided. You may choose any programming language for coding, but advisably to use Python. You will be provided with a supplementary code with helper functions, RPC protocol description and script/configuration files description.

Implementation of Hardware Monitoring APIs

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bachelor-internship
Description:

In this project we are developing a resource manager framework, part of this framework is to design monitoring APIs functionalities, which gathers hardware metrics upon invocation. You will be given a code template and your task is to fill in the code for some API functionalities. Supplementary programs and helper functions will be provided to be able to test your implementation. A documentation for the required API functions including their input and output parameters will be provided. This project requires C and Python programming as well as basic operating systems knowledge.

Benchmarking AI Workloads on GPU Cluster

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-21
Type: bachelor-project
Description:

Understanding the characteristics of AI workloads is essential for effective resource allocation and fault tolerance mechanisms. This project focuses on benchmarking various deep neural network (DNN) models on GPUs using different profiling and monitoring tools. You will observe and analyze their runtime behavior, identify the factors affecting model performance, and propose metrics that effectively quantify their runtime characteristics. The outcome of this project is to deliver a comprehensive study on profiling DNN models with minimal overhead and maximum accuracy.
References:
Gao, Wanling, et al. "Data motifs: A lens towards fully understanding big data and ai workloads." Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. 2018., https://arxiv.org/abs/1808.08512 Xiao, Wencong, et al. "Gandiva: Introspective cluster scheduling for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018., https://www.usenix.org/conference/osdi18/presentation/xiao Yang, Charlene, et al. "Hierarchical roofline performance analysis for deep learning applications." Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2. Springer International Publishing, 2021., https://arxiv.org/abs/2009.05257

Cluster Scheduling for DLT workloads

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-27
Type: student-colloquium
Description:

Estimating Deep Learning GPU Memory Consumption

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2023-12-11
Type: student-colloquium
Description:

Leveraging Structural Similarity for Performance Estimation of Deep Learning Training Jobs

Supervisors: Mahmoud Alasmar
Date: 2025-04-04
Type: master-internship
Description:

Deep learning (DL) workload-aware schedulers make decisions based on performance data collected through a process called profiling. However, profiling each job individually is computationally expensive, reducing the practicality of such approaches. Fortunately, DL models exhibit structural similarities that can be leveraged to develop alternative methods for performance estimation. One promising approach involves representing DL models as graphs and measuring their similarity using Graph Edit Distance (GED) [1]. By analyzing the structural similarities between models, we can potentially predict the performance of one model based on the known performance of another, reducing the need for extensive profiling. In this project, you will: Study and implement the similarity matching mechanism proposed in [2], compare runtime performance of similar DL models, focusing on key metrics such as GPU utilization and power consumption, and investigate the relationship between model similarity and performance predictability, trying to answer the following question: Given two similar DL models and the performance of one, what can we infer about the performance of the other? You will work with a selected set of DL training models, and performance metrics will be collected using Nvidia's GPU profiling tools, such as DCGM.
References:
Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1199–1214 Lai, F., Dai, Y., Madhyastha, H. V., & Chowdhury, M. (2023). {ModelKeeper}: Accelerating {DNN} training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 769-785).