Mahmoud Alasmar

DL Jobs Generator

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bi
Description:

In this project you will implement a job generator process. As an input, a JSON configuration (JSON) file and jobs generation rate (jobs per unit time) will be provided. The configuration file contains metadata about different Deep learning jobs, such as the path to the executable file and required arguments. Your task is to design and implement a generator which works as follows: Randomly select a job from the JSON file, use the metadata of the selected job to prepare a YAML/Batch script, a script template will be provided, and submit the prepared script to another process using an RPC protocol. The rate at which a job is sampled and submitted should be equal to the given generation rate. In addition, an implementation of an RPC protocol is required, the description of the protocol will be provided. You may choose any programming language for coding, but advisably to use Python. You will be provided with a supplementary code with helper functions, RPC protocol description and script/configuration files description.

Implementation of Hardware Monitoring APIs

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bi
Description:

In this project we are developing a resource manager framework, part of this framework is to design monitoring APIs functionalities, which gathers hardware metrics upon invocation. You will be given a code template and your task is to fill in the code for some API functionalities. Supplementary programs and helper functions will be provided to be able to test your implementation. A documentation for the required API functions including their input and output parameters will be provided. This project requires C and Python programming as well as basic operating systems knowledge.

Benchmarking AI Workloads on GPU Cluster

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-21
Type: bachelor
Description:

Understanding the characteristics of AI workloads is essential for effective resource allocation and fault tolerance mechanisms. This project focuses on benchmarking various deep neural network (DNN) models on GPUs using different profiling and monitoring tools. You will observe and analyze their runtime behavior, identify the factors affecting model performance, and propose metrics that effectively quantify their runtime characteristics. The outcome of this project is to deliver a comprehensive study on profiling DNN models with minimal overhead and maximum accuracy.
References:
Gao, Wanling, et al. "Data motifs: A lens towards fully understanding big data and ai workloads." Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. 2018., https://arxiv.org/abs/1808.08512 Xiao, Wencong, et al. "Gandiva: Introspective cluster scheduling for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018., https://www.usenix.org/conference/osdi18/presentation/xiao Yang, Charlene, et al. "Hierarchical roofline performance analysis for deep learning applications." Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2. Springer International Publishing, 2021., https://arxiv.org/abs/2009.05257

Cluster Scheduling for DLT workloads

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-27
Type: colloquium
Description:

References:
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel et al. "Gandiva: Introspective Cluster Scheduling for Deep Learning." In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 595-610. 2018. Weng, Q., Yang, L., Yu, Y., Wang, W., Tang, X., Yang, G., & Zhang, L. (2023). Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent. In 2023 USENIX Annual Technical Conference (USENIX ATC 23) (pp. 995-1008). Zhang, X., Zhao, H., Xiao, W., Jia, X., Xu, F., Li, Y., ... & Liu, F. (2024). Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling. arXiv preprint arXiv:2408.08586. Lai, F., Dai, Y., Madhyastha, H. V., & Chowdhury, M. (2023). {ModelKeeper}: Accelerating {DNN} training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 769-785).

Estimating Deep Learning GPU Memory Consumption

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2023-12-11
Type: colloquium
Description:

References:
Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. "Estimating GPU memory consumption of deep learning models." In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1342-1352. 2020. Haiyi Liu, Shaoying Liu, Chenglong Wen, and W. Eric Wong. "TBEM: Testing-Based GPU-Memory Consumption Estimation for Deep Learning." IEEE Access, 10, pp.39674-39680. 2022. Lu Bai, Weixing Ji, Qinyuan Li, Xilai Yao, Wei Xin, and Wanyi Zhu. "Dnnabacus: Toward accurate computational cost prediction for deep neural netw." arXiv preprint arXiv:2205.12095. 2022. Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang. "Runtime performance prediction for deep learning models with graph neural network." In IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 368-380. 2023.

Leveraging Structural Similarity for Performance Estimation of Deep Learning Training Jobs

Supervisors: Mahmoud Alasmar
Date: 2025-04-04
Type: internship
Description:

Deep learning (DL) workload-aware schedulers make decisions based on performance data collected through a process called profiling. However, profiling each job individually is computationally expensive, reducing the practicality of such approaches. Fortunately, DL models exhibit structural similarities that can be leveraged to develop alternative methods for performance estimation. One promising approach involves representing DL models as graphs and measuring their similarity using Graph Edit Distance (GED) [1]. By analyzing the structural similarities between models, we can potentially predict the performance of one model based on the known performance of another, reducing the need for extensive profiling. In this project, you will: Study and implement the similarity matching mechanism proposed in [2], compare runtime performance of similar DL models, focusing on key metrics such as GPU utilization and power consumption, and investigate the relationship between model similarity and performance predictability, trying to answer the following question: Given two similar DL models and the performance of one, what can we infer about the performance of the other? You will work with a selected set of DL training models, and performance metrics will be collected using Nvidia's GPU profiling tools, such as DCGM.
References:
Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1199–1214 Lai, F., Dai, Y., Madhyastha, H. V., & Chowdhury, M. (2023). {ModelKeeper}: Accelerating {DNN} training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 769-785).

Visiting address

Quick links