TY - GEN
T1 - Extractive summarization using cohesion network analysis and submodular set functions
AU - Cioaca, Valentin Sergiu
AU - Dascalu, Mihai
AU - McNamara, Danielle S.
N1 - Publisher Copyright: © 2020 IEEE.
PY - 2020/9
Y1 - 2020/9
N2 - Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.
AB - Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.
KW - Cohesion Network Analysis
KW - Extractive summarization
KW - SpaCy framework
KW - Submodular functions
KW - TextRank
KW - Word Mover's Distance
UR - http://www.scopus.com/inward/record.url?scp=85102346431&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102346431&partnerID=8YFLogxK
U2 - 10.1109/SYNASC51798.2020.00035
DO - 10.1109/SYNASC51798.2020.00035
M3 - Conference contribution
T3 - Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020
SP - 161
EP - 168
BT - Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020
Y2 - 1 September 2020 through 4 September 2020
ER -