Description
Many widely-used distributed applications run on geo-distributed data centers (DCs). To timely understand and analyze the logs with low latency, it is required to control the communication costs. However, processing real time logs in a global scale is challenging due to its colossal volume and the expensive wide area networks (WANs) that unpredictably change over time, which makes it impractical to gather all data events into a single DC.
To resolve these challenges, we aim to perform aggregation operations in a decentralized and a hierarchical way within the data plane. We profile the network bandwidth and delay among the different executors of the node to perform clustering to create a tree of nodes based on their distance. With the profiled information, we aggregate and summarize the data along the hierarchy of the nodes, sorted by the distance within the nodes of the cluster, so that the data travelling through the WAN is minimized among the cluster. In the meanwhile, we also aim to control the fidelity of the aggregated data over the different distances to keep the network delays low.
In order to fine-tune the specific levels of the hierarchy, as well as to control the fidelity between the different levels of hierarchy to keep the bandwidth utilizations high and the network delays low, our system takes an automatic learning-based approach. With the profiled network metrics, our model looks for the most efficient number of levels for the hierarchically clustered tree of nodes, and finds the adequate level of fidelity to set for each of the levels of the hierarchy.
We aim to solve the following action items:
- Implementing and checking the correctness of the intermediate shuffle
- Evaluations for the throughput to confirm the performance improvement
- Adding the layer of fidelity control on the different levels of hierarchy
- A learning-based approach to automatically find the right levels of hierarchy and the level of fidelity for each level