TY - JOUR
T1 - Triplet network with attention for speaker diarization
AU - Song, Huan
AU - Willi, Megan
AU - Thiagarajan, Jayaraman J.
AU - Berisha, Visar
AU - Spanias, Andreas
N1 - Funding Information: This work was supported in part by the SenSIP center at Arizona State University. This work was performed under the auspices of the U.S. Dept. of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Publisher Copyright: © 2018 International Speech Communication Association. All rights reserved.
PY - 2018
Y1 - 2018
N2 - In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.
AB - In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.
KW - Attention models
KW - Metric learning
KW - Speaker diarization
KW - Triplet network
UR - http://www.scopus.com/inward/record.url?scp=85054959506&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054959506&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2018-2305
DO - 10.21437/Interspeech.2018-2305
M3 - Conference article
SN - 2308-457X
VL - 2018-September
SP - 3608
EP - 3612
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018
Y2 - 2 September 2018 through 6 September 2018
ER -