TY - GEN
T1 - Approximately Optimal Distributed Data Shuffling
AU - Attia, Mohamed Adel
AU - Tandon, Ravi
N1 - Funding Information: This work was supported by the NSF grant CAREER-1651492 Funding Information: This work was supported by the NSF grant CAREER-1651492. Publisher Copyright: © 2018 IEEE.
PY - 2018/8/15
Y1 - 2018/8/15
N2 - Data shuffling between distributed workers is one of the critical steps in implementing large-scale learning algorithms. The focus of this work is to understand the fundamental trade-off between the amount of storage and the communication overhead for distributed data shuffling. We first present an information theoretic formulation for the data shuffling problem, accounting for the underlying problem parameters (i.e., number of workers, K, number of data points, N, and the available storage, S per node). Then, we derive an information theoretic lower bound on the communication overhead for data shuffling as a function of these parameters. Next, we present a novel coded communication scheme and show that the resulting communication overhead of the proposed scheme is within a multiplicative factor of at most 2 from the lower bound. Furthermore, we introduce an improved aligned coded shuffling scheme, which achieves the optimal storage vs communication trade-off for K < 5, and further reduces the maximum multiplicative gap down to 7/6, for K ≥ 5.
AB - Data shuffling between distributed workers is one of the critical steps in implementing large-scale learning algorithms. The focus of this work is to understand the fundamental trade-off between the amount of storage and the communication overhead for distributed data shuffling. We first present an information theoretic formulation for the data shuffling problem, accounting for the underlying problem parameters (i.e., number of workers, K, number of data points, N, and the available storage, S per node). Then, we derive an information theoretic lower bound on the communication overhead for data shuffling as a function of these parameters. Next, we present a novel coded communication scheme and show that the resulting communication overhead of the proposed scheme is within a multiplicative factor of at most 2 from the lower bound. Furthermore, we introduce an improved aligned coded shuffling scheme, which achieves the optimal storage vs communication trade-off for K < 5, and further reduces the maximum multiplicative gap down to 7/6, for K ≥ 5.
UR - http://www.scopus.com/inward/record.url?scp=85052431858&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052431858&partnerID=8YFLogxK
U2 - 10.1109/ISIT.2018.8437325
DO - 10.1109/ISIT.2018.8437325
M3 - Conference contribution
SN - 9781538647806
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 721
EP - 725
BT - 2018 IEEE International Symposium on Information Theory, ISIT 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Symposium on Information Theory, ISIT 2018
Y2 - 17 June 2018 through 22 June 2018
ER -