Travel Report: Yanfang Le, Computer Science, Hong Kong
With the rapid growth of information in such applications as social networking and bioinformatics, there is an urgent need for large-scale data analysis and processing. MapReduce has recently emerged as a powerful tool for distributed and scalable processing of voluminous data due to its remarkable features in simplicity, flexibility, fault tolerance, and scalability. Nevertheless, the MapReduce framework is also criticized for its inefficiency in performance and as "a major step backward". This is partially because the MapReduce framework has not been well studied and fine-tuned compared with the conventional tools. Our previous work has addressed the load balancing issue for MapReduce with skewed data input in an online fashion.
Examining the project again, we found that the datacenter network could be a bottleneck in the shuffle subphase of MapReduce. This could potentially lead to a poor overall performance even with a balanced workload and thus needed to be addressed. Recent studies have suggested that the network inside the datacenter can be a potential bottleneck, too. Thus, the data moving costs and the bottlenecks within datacenter network must be jointly considered when minimizing the overall finishing time for MapReduce. Professor Dan Wang from Hong Kong Polytechnic University and PolyU’s Shenzhen Research Institute in China has great experience in the research area of big data and networking. HK PolyU’s Shenzhen Research Institute has a large-scale high-performance experimental datacenter for research purpose. Hence, I visited his lab for collaboration during summer 2014.
During the visit, we focused on the measurement about the datacenter network traffic. Our measurement results reveal that (1)the latency for shuffling is comparable to that of the reduce function, although only the latter is responsible for computation; (2)there is a great prevalence of highly utilized links (utility of 70% or higher), which are located throughout the datacenter network; (3)excessive network traffic for the sheer amount data movement can indeed contradict the benefit of load balancing if the network bottlenecks are not carefully avoided. Thus, the traffic and the bottlenecks of the underlying network play critical roles toward designing a practical load balancing strategy for MapReduce in real world datacenters. This verified our hypothesis to the datacenter network traffic and proved our proposed problem is a valuable problem. In addition to the measurement work, we verified the performance of our solutions on the real world systems. The manuscript about this topic is ongoing. We plan to submit this work to a top-tier conference in the networking field.
Tags: Student Voices