Projects

Dataset Management

Supervisors: Huy Truong, Andrés Tello
Date: 2025-08-25
Type: bi
Description:

Have you ever managed a large dataset? This project provides an opportunity to handle a dataset with over 8,000 downloads each month. You will reorganize the dataset by task, document it thoroughly, and create a user-friendly interface and leaderboard. The project also involves working with HPC clusters, Hugging Face libraries, and GitHub Pages for documentation. Basic Python skills and familiarity with Linux commands are required.

DL Jobs Generator

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bi
Description:

In this project you will implement a job generator process. As an input, a JSON configuration (JSON) file and jobs generation rate (jobs per unit time) will be provided. The configuration file contains metadata about different Deep learning jobs, such as the path to the executable file and required arguments. Your task is to design and implement a generator which works as follows: Randomly select a job from the JSON file, use the metadata of the selected job to prepare a YAML/Batch script, a script template will be provided, and submit the prepared script to another process using an RPC protocol. The rate at which a job is sampled and submitted should be equal to the given generation rate. In addition, an implementation of an RPC protocol is required, the description of the protocol will be provided. You may choose any programming language for coding, but advisably to use Python. You will be provided with a supplementary code with helper functions, RPC protocol description and script/configuration files description.

Implementation of Hardware Monitoring APIs

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2024-11-01
Type: bi
Description:

In this project we are developing a resource manager framework, part of this framework is to design monitoring APIs functionalities, which gathers hardware metrics upon invocation. You will be given a code template and your task is to fill in the code for some API functionalities. Supplementary programs and helper functions will be provided to be able to test your implementation. A documentation for the required API functions including their input and output parameters will be provided. This project requires C and Python programming as well as basic operating systems knowledge.

Mining sensors data for anomaly detection (with industrial partner)

Supervisors: Dilek Düştegör, Revin Alief
Date: 2024-10-28
Type: bi
Description:

In collaboration with an industrial partner, we have access to sensor data collected from various machines. We need a curious student to look at the dataset from multiple possible angles and see if datamining / data science techniques can identify any pattern. This is an ideal project if you want to learn about datamining / data science techniques.

Mining sales data to identify patterns (with industrial partner)

Supervisors: Dilek Düştegör, Revin Alief
Date: 2024-10-28
Type: bi
Description:

In collaboration with an industrial partner, we have access to their sales data. We need a curious student to look at the dataset from multiple possible angles and see if data mining / data science techniques can identify any pattern. This is an ideal project if you want to learn about datamining / data science techniques.

Research packages for the formal specification and verification of process compositions

Supervisors: Heerko Groefsema
Date: 2024-10-21
Type: bi
Description:

For our research we implemented and use a number of Java packages that allow us to specify, unfold, and verify process compositions such as business process models and service compositions. These packages require some work, including new functionality, replacing old dependencies, adding different output formats, replacing log functionality, refactoring to use certain programming patterns, and more. In this project, we would like a number of students to improve, refactor, and add functionality. This project is available for up to 5 students, which will work on separate sub-projects such as: - Adding rich Event Log generation from random executions of annotated Petri net models. - Separating embedded data annotations and allowing execution of Petri nets using data. - Adding functionality for colored Petri nets. - Implementing improved Prime Event Structures (PES) representations of processes and unfolding (i.e., creation of PES) from Petri nets. - Replacing old dependencies and refactoring.

In-Network Atomic Multicast Protocol Validation and Verification.

Supervisors: Bochra Boughzala
Date: 2024-10-29
Type: bi
Description:

This project involves creating a high-performance networking application using the Intel Data Plane Development Kit (DPDK). The application will receive network packets over a high-speed interface, inspect the packet header fields, and validate the protocol properties for correctness and consistency guarantees. The tool will verify protocol correctness by making sure that all receiving nodes receive the same set of packets in the exact order, guaranteeing total order and consistent state across all nodes.
Prerequisites: C/C++ programming language - Networking libraries and tools for protocol analysis - Logging library for error reporting.

In-Network Data Stream Processing Serialization.

Supervisors: Bochra Boughzala
Date: 2024-10-29
Type: bi
Description:

This project involves creating a high-performance networking application using the Intel Data Plane Development Kit (DPDK). The application will read entries from a specified database, convert these entries into JSON format, and send the JSON entries as packets over a high-speed network. The focus is on achieving low-latency and high-throughput performance.
Prerequisites: C/C++ programming language - Database Management (e.g., PostgreSQL databases).

Automated Dataset Generator for Wastewater System Simulations

Supervisors: Dilek Düştegör, Revin Alief
Date: 2025-01-21
Type: bachelor
Description:

Automating dataset generation is essential for streamlining the preprocessing pipeline, enabling faster iterations and ensuring scalability as wastewater systems become increasingly complex. This project involves designing and implementing a program to automate dataset generation for wastewater simulations. The program will take inputs such as simulation parameters, geographic data, and network structures (e.g., pipe diameters and node connections) and output datasets compatible with Graph Neural Network (GNN) models. The student will focus on creating a flexible, user-friendly interface for defining input parameters, ensuring compatibility with graph-based models and simulation outputs, and testing the tool across multiple simulation scenarios. The expected outcome is a reusable dataset generation tool that significantly enhances the efficiency of wastewater simulation workflows.
References:
Infoworks ICM Exchange Infoworks Ruby Scripts

Optimizing Graph Neural Networks for Water Level Estimation

Supervisors: Dilek Düştegör, Revin Alief
Date: 2025-01-21
Type: bachelor
Description:

Optimizing Graph Neural Networks (GNNs) is critical for enhancing the accuracy and efficiency of water level predictions, which directly influences the reliability of wastewater management systems. This project focuses on refining specific aspects of GNNs for predicting water levels at nodes in wastewater networks. The student will explore hyperparameter tuning, feature engineering, and advanced GNN variants such as Graph Attention Networks (GAT). Key tasks include studying the impact of different GNN architectures and parameters on performance, experimenting with incorporating edge features (e.g., pipe diameters), and evaluating models on datasets generated by ICM. The expected outcome is a comprehensive analysis of optimization techniques for GNNs in the wastewater domain.
References:
Zhang, Z., Tian, W., Lu, C., Liao, Z., & Yuan, Z. (2024). Graph neural network-based surrogate modelling for real-time hydraulic prediction of urban drainage networks. Water Research, 263, 122142. https://doi.org/10.1016/j.watres.2024.122142 Li, M., Shi, X., Lu, Z., & Kapelan, Z. (2024). Predicting the urban stormwater drainage system state using the Graph-WaveNet. Sustainable Cities and Society, 115, 105877. https://doi.org/10.1016/j.scs.2024.105877

Benchmarking AI Workloads on GPU Cluster

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-21
Type: bachelor
Description:

Understanding the characteristics of AI workloads is essential for effective resource allocation and fault tolerance mechanisms. This project focuses on benchmarking various deep neural network (DNN) models on GPUs using different profiling and monitoring tools. You will observe and analyze their runtime behavior, identify the factors affecting model performance, and propose metrics that effectively quantify their runtime characteristics. The outcome of this project is to deliver a comprehensive study on profiling DNN models with minimal overhead and maximum accuracy.
References:
Gao, Wanling, et al. "Data motifs: A lens towards fully understanding big data and ai workloads." Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. 2018., https://arxiv.org/abs/1808.08512 Xiao, Wencong, et al. "Gandiva: Introspective cluster scheduling for deep learning." 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018., https://www.usenix.org/conference/osdi18/presentation/xiao Yang, Charlene, et al. "Hierarchical roofline performance analysis for deep learning applications." Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2. Springer International Publishing, 2021., https://arxiv.org/abs/2009.05257

DiTEC project- Unsupervised Learning for Customer Profiles in Water Distribution Networks

Supervisors: Huy Truong, Dilek Düştegör
Date: 2025-01-21
Type: bachelor
Description:

Researchers studying drinking water distribution networks often rely on large-scale synthesized datasets. However, the current simulation generating these datasets faces limitations in retrieving metadata that enhances dataset accessibility. Moreover, this missing metadata, including customer profiles at each node in the network, plays a crucial role in classifying customer types and estimating their demand, particularly during peak seasons. To address this gap, the student could apply unsupervised clustering algorithms such as K-NN, K-means, or DBSCAN to identify and retrieve these customer profiles. This project requires a candidate with a solid background in Machine Learning and interest in building robust data pipelines. They will be eventually employed to extract the missing metadata for a large-scale dataset, enabling water experts to analyze water network and benchmark customer behavior efficiently.
References:
Tello, A., Truong, H., Lazovik, A., & Degeler, V. (2024). Large-scale multipurpose benchmark datasets for assessing data-driven deep learning approaches for water distribution networks. Engineering Proceedings, 69(1), 50. https://doi.org/10.3390/engproc2024069050.

Node masking in Graph Neural Networks

Supervisors: Huy Truong, Dilek Düstegör
Date: 2025-01-21
Type: bachelor
Description:

Working with data in the real world often leads to missing information problems, which can negatively affect the performance of deep learning models. However, in proper ways, it can boost the expressiveness of Graph Neural Network (GNN) models in node representation learning through a technique known as Node Masking. In particular, it hides arbitrary nodal features in a graph and instructs the GNN to recover the missing parts. The student can explore diverse masking strategies, such as zero masking, random node replacement, mean-neighbor substitution, shared learnable embedding, and nodal permutation. These options above should be compared and evaluated in a graph reconstruction task that applies to a water distribution network. This study will focus on finding a generative technique that effectively enhances the performance of GNN models in semi-supervised transductive learning. Students interested in joining this project should possess a machine-learning background and a deep-learning framework.
References:
Hou, Zhenyu, Xiao Liu, Yuxiao Dong, Chunjie Wang, and Jie Tang. "GraphMAE: Self-Supervised Masked Graph Autoencoders." arXiv preprint arXiv:2205.10803(2022). Abboud, Ralph, Ismail Ilkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. "The surprising power of graph neural networks with random node initialization." arXiv preprint arXiv:2010.01179 (2020). Hajgató, Gergely, Bálint Gyires-Tóth, and György Paál. "Reconstructing nodal pressures in water distribution systems with graph neural networks." arXiv preprint arXiv:2104.13619 (2021). He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. "Masked autoencoders are scalable vision learners." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000-16009. 2022.

Conditional planning an overview of approaches

Supervisors: Heerko Groefsema
Date: 2025-02-05
Type: colloquium
Description:

Verifying the data perspective of business processes

Supervisors: Heerko Groefsema
Date: 2025-02-05
Type: colloquium
Description:

Explaining Graph Neural Networks

Supervisors: Andrés Tello
Date: 2025-02-01
Type: colloquium
Description:

Generalization in Graph Neural Networks

Supervisors: Andrés Tello
Date: 2025-02-01
Type: colloquium
Description:

Multimodality Graph Foundation Models

Supervisors: Huy Truong
Date: 2025-01-21
Type: colloquium
Description:

Test-Time Training

Supervisors: Huy Truong
Date: 2025-01-21
Type: colloquium
Description:

Cluster Scheduling for DLT workloads

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2025-01-27
Type: colloquium
Description:

Enhancing Wastewater System Monitoring through Graph Neural Networks

Supervisors: Dilek Düstegör, Revin Alief
Date: 2025-01-21
Type: colloquium
Description:

Federated Learning Approaches for Distributed Decision-Making in Wastewater System Management

Supervisors: Dilek Düstegör, Revin Alief
Date: 2025-01-21
Type: colloquium
Description:

Estimating Deep Learning GPU Memory Consumption

Supervisors: Kawsar Haghshenas, Mahmoud Alasmar
Date: 2023-12-11
Type: colloquium
Description:

Distributed Digital Twin

Supervisors: Dilek Düstegör
Date: 2023-11-30
Type: colloquium
Description:

Digital Twin for Water Network

Supervisors: Dilek Düstegör
Date: 2023-11-30
Type: colloquium
Description:

Data Driven Methods for Leakage Detection in Water Network

Supervisors: Dilek Düstegör
Date: 2023-11-30
Type: colloquium
Description:

Large Language Model to extract from digitized archives

Supervisors: Dilek Dustegor
Date: 2025-04-11
Type: internship
Description:

This project aims to conduct a pilot study utilizing Large Language Models (LLMs) to extract and analyze data on historical tortoiseshell trade from the Dutch East India Company archives. The intern is expected to build an LLM solution to extract all information related to marine life, including but not limited to location, quantity, date, type. This solution would involve dealing with large machine-translated files, serving LLM, prompt engineering, and documentation of the process. This is part of a larger project, where the extracted data will be analyzed within the framework of historical ecology, focusing on quantities, temporal patterns, and geographic distribution (in collaboration with Willemien de Kock (Faculty of Arts) and Emin Tatar (CIT)).
References:
5 million scans VOC archives online and searchable.

DiTEC project- Building a collection of Graph Self-Supervised Learning tasks

Supervisors: Huy Truong
Date: 2025-04-03
Type: internship
Description:

Self-supervised learning (SSL) has shown great potential in enhancing the capabilities of large foundation models, but its application to graph modalities remains underexplored. This project aims to investigate popular SSL tasks across node-level, link-level, and graph-level challenges, as well as more complex graph representation learning approaches. As a researcher of the project, the candidate will develop a framework that enables users to train deep learning models using these tasks on independent datasets. The final deliverables will include the implementation code and a report detailing the problem and the proposed solution. The ideal candidate should have a background in machine learning, and experience with at least one deep learning framework.
References:
Liu, Yixin, et al. "Graph self-supervised learning: A survey." *IEEE transactions on knowledge and data engineering* 35.6 (2022): 5879-5900. Wu, Lirong, et al. "Self-supervised learning on graphs: Contrastive, generative, or predictive." *IEEE Transactions on Knowledge and Data Engineering* 35.4 (2021): 4216-4235.

DiTEC project- Inverse problem in Water Distribution Networks

Supervisors: Huy Truong
Date: 2025-04-03
Type: internship
Description:

Water researchers have relied on simulations to monitor the behavior of Water Distribution Networks. These simulations require a comprehensive set of parameters- such as elevation, demand, and pipe diameters-to determine hydraulic states accurately. This increases labor cost, time consumption and, therefore, poses a significant challenge. But what if we could reverse the process and let AI infer the missing pieces? Building on this idea, the project explores an innovative approach: leveraging data-driven deep learning methods to predict initial input conditions based on available output states. As a researcher on this project, the candidate will select and train a cutting-edge Graph Neural Network on a massive dataset. As a result, the model should be able to predict initial conditions while considering the structural and physical constraints. The candidate will submit the implementation code and a report detailing the problem and the proposed solution. The ideal candidate should have a background in machine learning and be familiar with at least one deep-learning framework.
References:
Truong, Huy, et al. "DiTEC-WDN: A Large-Scale Dataset of Hydraulic Scenarios across Multiple Water Distribution Networks." (2025).

DiTEC project- Bio-inspired Water Network Design

Supervisors: Huy Truong
Date: 2025-04-03
Type: internship
Description:

Designing Water Distribution Networks (WDN) has been portrayed as a complex, labor, and time-consuming process. To alleviate this, the project aims to automate the design using Evolution Strategy (ES). In particular, these algorithms should search and optimize values of hydraulic parameters, such as nodal elevation, pump speed, and pipe length, to construct a complete simulation configuration. This configuration should follow the local, structural, and physical restrictions (i.e., multi-objective optimization). As a researcher on this project, the candidate will explore an ES framework to develop the optimization algorithm and apply it to a water distribution domain. As such, the candidate should be familiar with machine-learning experiments. As deliverables, the candidate should submit the report and implementation code that generates optimized configurations. These configurations will help water researchers simulate, analyze, and understand the WDN’s behavior and enhance the monitoring capability of these systems in practice.
References:
Gad, Ahmed Fawzy. "Pygad: An intuitive genetic algorithm python library." *Multimedia tools and applications* 83.20 (2024): 58029-58042. Toklu, Nihat Engin, et al. "Evotorch: Scalable evolutionary computation in python." *arXiv preprint arXiv:2302.12600* (2023). Lange, Robert Tjarko. "evosax: Jax-based evolution strategies, 2022." *URL http://github.com/RobertTLange/evosax* 7 (2022).

Can we train a Neural Network with Forward-Forward “harmoniously”?

Supervisors: Huy Truong
Date: 2025-04-03
Type: internship
Description:

Back Propagation(BP) is a de facto approach to training neural network models. Nevertheless, it is biologically implausible and requires complete knowledge (i.e., tracks the entire flow of information from start to end of a model) to perform a backward pass. Instead, an alternative approach called Forward-Forward(FF) can replace the backward pass with an additional forward one and update the model weights in an unsupervised fashion. In particular, FF performs forward passes using positive and negative inputs, respectively, and, therefore, employs the difference between the two activation versions at each layer in the neural network to compute the loss and update weights. Here, the project studies the behavior of FF employing different losses: (1) cross-entropy and (2) harmonic loss. Also, it is valuable to study the relevance between harmonic loss and FF in terms of distance metrics or geometric properties in an embedding space. As a deliverable, the candidate should submit a detailed report and implementation code. For primary requirements, the candidate should be familiar with one of the deep learning frameworks and have experience in setting up machine learning experiments.
References:
Hinton, Geoffrey. "The forward-forward algorithm: Some preliminary investigations." *arXiv preprint arXiv:2212.13345* (2022). Baek, David D., et al. "Harmonic Loss Trains Interpretable AI Models." *arXiv preprint arXiv:2502.01628* (2025).

Leveraging Structural Similarity for Performance Estimation of Deep Learning Training Jobs

Supervisors: Mahmoud Alasmar
Date: 2025-04-04
Type: internship
Description:

Deep learning (DL) workload-aware schedulers make decisions based on performance data collected through a process called profiling. However, profiling each job individually is computationally expensive, reducing the practicality of such approaches. Fortunately, DL models exhibit structural similarities that can be leveraged to develop alternative methods for performance estimation. One promising approach involves representing DL models as graphs and measuring their similarity using Graph Edit Distance (GED) [1]. By analyzing the structural similarities between models, we can potentially predict the performance of one model based on the known performance of another, reducing the need for extensive profiling. In this project, you will: Study and implement the similarity matching mechanism proposed in [2], compare runtime performance of similar DL models, focusing on key metrics such as GPU utilization and power consumption, and investigate the relationship between model similarity and performance predictability, trying to answer the following question: Given two similar DL models and the performance of one, what can we infer about the performance of the other? You will work with a selected set of DL training models, and performance metrics will be collected using Nvidia's GPU profiling tools, such as DCGM.
References:
Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1199–1214 Lai, F., Dai, Y., Madhyastha, H. V., & Chowdhury, M. (2023). {ModelKeeper}: Accelerating {DNN} training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 769-785).