In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

Original languageEnglish (US)
Pages (from-to)3608-3612
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018


  • Attention models
  • Metric learning
  • Speaker diarization
  • Triplet network

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation


Dive into the research topics of 'Triplet network with attention for speaker diarization'. Together they form a unique fingerprint.

Cite this