Details
-
New Feature
-
Status: Open
-
Critical
-
Resolution: Unresolved
Description
Synopsis
The current Apache Wayang (Incubating) uses a cost model to select the right set of platforms while optimizing the query plans. Nevertheless, the accuracy of picking the correct configuration depends on the cost model's quality; The idea is to build an AI pipeline capable of generating data for the current profiler of Apache Wayang (Incubating), where another AI component is the main component for the calibration process.
Benefits to Community
The benefits for the community will be the option of having a well-calibrated cost model for their environments with low human effort. Being cost modelling one of the most difficult tasks, having such an AI pipeline will enrich users’ experience when using Apache Wayang (Incubating).
Deliverables
The delivery expected is an adaptation of the paper "Expand your Training Limits! Generating Training Data for ML-based Data Management" [1], where the authors assume an ML-Cost-Model, but in this case, the idea needs modifications to run in the current setup of Apache Wayang(Incubating).
The expected steps are the following:
- Understand the paper [1]
- Get Into the current process of the profiler of Apache Wayang (Incubating)
- Design the AI profile pipeline, based on [1] and the current profiler
- Discuss ideas on how to integrate the designed AI pipeline into Apache Wayang(Incubating)
- Implement the AI-DataGenerator Component
Related Work
[1] Expand your Training Limits! Generating Training Data for ML-based Data Management
[2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems](https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf)
Biographical Information
Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing intensive processing data systems for several industries, such as banking systems. He was a research engineer at the Qatar Computing Research Institute, where he was responsible for developing the declarative query engine for Rheem and adding new underlying platforms to Rheem.
Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of the PPMC of Apache Wayang(Incubating). He has many years of experience developing applications that support Big Data processing, with experience implementing ETL processes over distributed systems to optimize inventories in supply chains. He was a research engineer at the Qatar Computing Research Institute, where he specialized in human interface interaction with big data analytics. During this time, he co-develop an ML-based cross-platform query optimizer.
Jorge Quiané is the head of the Big Data Systems research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and a Principal Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of the IAM group at the German Research Center for ArtificialIntelligence (DFKI). His current research is in the broad area of big data: mainly in federated data analytics, scalable data infrastructures, and distributed query processing. He has published numerous research papers on data management and novel system architectures. He has recently been honoured with the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds five patents in core database areas and on machine learning. Earlier in his career, he was a Senior Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his PhD in computer science from INRIA (Nantes University).