TY - JOUR
T1 - Compressing LSTM networks with hierarchical coarse-grain sparsity
AU - Kadetotad, Deepak
AU - Meng, Jian
AU - Berisha, Visar
AU - Chakrabarti, Chaitali
AU - Seo, Jae Sun
N1 - Funding Information: This work was in part supported by NSF grant 1652866, Samsung, ONR, and C-BRIC, one of six centers in JUMP, a SRC program sponsored by DARPA. Funding Information: This work was in part supported by NSF grant 1652866, Sam-sung, ONR, and C-BRIC, one of six centers in JUMP, a SRC program sponsored by DARPA. Publisher Copyright: Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - The long short-term memory (LSTM) network is one of the most widely used recurrent neural networks (RNNs) for automatic speech recognition (ASR), but is parametrized by millions of parameters. This makes it prohibitive for memory-constrained hardware accelerators as the storage demand causes higher dependence on off-chip memory, which bottlenecks latency and power. In this paper, we propose a new LSTM training technique based on hierarchical coarse-grain sparsity (HCGS), which enforces hierarchical structured sparsity by randomly dropping static block-wise connections between layers. HCGS maintains the same hierarchical structured sparsity throughout training and inference; this reduces weight storage for both training and inference hardware systems. We also jointly optimize in-training quantization with HCGS on 2-/3-layer LSTM networks for the TIMIT and TED-LIUM corpora. With 16× structured compression and 6-bit weight precision, we achieved a phoneme error rate (PER) of 16.9% for TIMIT and a word error rate (WER) of 18.9% for TED-LIUM, showing the best trade-off between error rate and LSTM memory compression compared to prior works.
AB - The long short-term memory (LSTM) network is one of the most widely used recurrent neural networks (RNNs) for automatic speech recognition (ASR), but is parametrized by millions of parameters. This makes it prohibitive for memory-constrained hardware accelerators as the storage demand causes higher dependence on off-chip memory, which bottlenecks latency and power. In this paper, we propose a new LSTM training technique based on hierarchical coarse-grain sparsity (HCGS), which enforces hierarchical structured sparsity by randomly dropping static block-wise connections between layers. HCGS maintains the same hierarchical structured sparsity throughout training and inference; this reduces weight storage for both training and inference hardware systems. We also jointly optimize in-training quantization with HCGS on 2-/3-layer LSTM networks for the TIMIT and TED-LIUM corpora. With 16× structured compression and 6-bit weight precision, we achieved a phoneme error rate (PER) of 16.9% for TIMIT and a word error rate (WER) of 18.9% for TED-LIUM, showing the best trade-off between error rate and LSTM memory compression compared to prior works.
KW - Long short-term memory
KW - Speech recognition
KW - Structured sparsity
KW - Weight compression
UR - http://www.scopus.com/inward/record.url?scp=85098177563&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098177563&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1270
DO - 10.21437/Interspeech.2020-1270
M3 - Conference article
SN - 2308-457X
VL - 2020-October
SP - 21
EP - 25
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -