Paralinguistic classification of mask wearing by image classifiers and fusion

Jeno Szep, Salim Hariri

Research output: Contribution to journalConference articlepeer-review

19 Scopus citations

Abstract

In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the 'train' dataset and reaches 80.1% UAR on the test set when training includes the 'devel' dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the 'mask' versus 'clear' classification.

Original languageEnglish (US)
Pages (from-to)2087-2091
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2020-October
DOIs
StatePublished - 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: Oct 25 2020Oct 29 2020

Keywords

  • Computational paralinguistics
  • Convolutional neural networks (CNN)
  • Ensemble learning
  • Image-classification
  • Spectrogram

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Paralinguistic classification of mask wearing by image classifiers and fusion'. Together they form a unique fingerprint.

Cite this