New Feature
Status: Open
Resolution: Unresolved
The current Apache Wayang (Incubating) uses a cost model to compute the right platforms and optimize the plans; however, calibrating cost models is one of the hardest problems in practice and the main cause for a system to underperform. Therefore, the goal is to create a new optimizer component that has ML at its core: the entire plan enumeration is guided and powered by a ML model.
Benefits to Community
The benefit for the community will be getting an ML optimizer, which means that the optimization quality will depend on the data used for training the model, instead of a human trying to figure out the best calibration of the cost model. The ML-based Query Optimizer will result in more people using Apache Wayang(Incubating) with almost no effort in terms of configurations. This will also inspire other projects to incorporate similar optimization modules into their systems.
The delivery expected is an adaptation for the paper "ML-based Cross-Platform Query Optimization"[1], where the authors proposed a Machine learning model that can be used as the Query optimizer inside of Apache Wayang(Incubating)
The step expected are the following:
- Understand the paper [1]
- Get into the internals of the optimizer of Apache Wayang(Incubating)
- Discuss and design the process for the ML Query Optimizer
- Implement the new ML-based Query Optimizer
Related Work
[1] [ML-based Cross-Platform Query Optimization](
[2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems](
Biographical Information of possible mentors
Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing applications that support Big Data processing, with experience implementing ETL processes over distributed systems to optimize inventories in supply chains. He was a research engineer at the Qatar Computing Research Institute, where he specialized in human interface interaction with big data analytics. During this time, he co-develop an ML-based cross-platform query optimizer.
Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing intensive processing data systems for several industries, such as banking systems. He was a research engineer at the Qatar Computing Research Institute, where he was responsible for developing the declarative query engine for Rheem and adding new underlying platforms to Rheem.
Jorge Quiané is the head of the Big Data Systems research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and a Principal Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of the IAM group at the German Research Center for ArtificialIntelligence (DFKI). His current research is in the broad area of big data: mainly in federated data analytics, scalable data infrastructures, and distributed query processing. He has published numerous research papers on data management and novel system architectures. He has recently been honoured with the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds five patents in core database areas and on machine learning. Earlier in his career, he was a Senior Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his PhD in computer science from INRIA (Nantes University).
Name and Contact Information
Name: Rodrigo Pardo-Meza
email: rpardomeza (at)
community: dev (at)