TY - GEN
T1 - Beyond the English web
T2 - 16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, EACL 2021
AU - Repo, Liina
AU - Skantsi, Valtteri
AU - Rönnqvist, Samuel
AU - Hellström, Saara
AU - Oinonen, Miika
AU - Salmela, Anna
AU - Biber, Douglas
AU - Egbert, Jesse
AU - Pyysalo, Sampo
AU - Laippala, Veronika
N1 - Publisher Copyright: © 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.
AB - We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.
UR - http://www.scopus.com/inward/record.url?scp=85107441042&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107441042&partnerID=8YFLogxK
M3 - Conference contribution
T3 - EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
SP - 183
EP - 191
BT - EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PB - Association for Computational Linguistics (ACL)
Y2 - 19 April 2021 through 23 April 2021
ER -