TY - GEN
T1 - Automated PII Extraction from Social Media for Raising Privacy Awareness
T2 - 19th Annual IEEE International Conference on Intelligence and Security Informatics, ISI 2021
AU - Liu, Yizhi
AU - Lin, Fang Yu
AU - Ebrahimi, Mohammadreza
AU - Li, Weifeng
AU - Chen, Hsinchun
N1 - Funding Information: ACKNOWLEDGMENTS This material is based upon work supported by the National Science Foundation (NSF) under Secure and Trustworthy Cyberspace (SaTC) (grant No. 1936370). Publisher Copyright: © 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Internet users have been exposing an increasing amount of Personally Identifiable Information (PII) on social media. Such exposed PII can be exploited by cybercriminals and cause severe losses to the users. Informing users of their PII exposure in social media is crucial to raise their privacy awareness and encourage them to take protective measures. To this end, advanced techniques are needed to extract users' exposed PII in social media automatically, whereas most existing studies remain manual. While Information Extraction (IE) techniques can be used to extract the PII automatically, Deep Learning (DL)-based IE models alleviate the need for feature engineering and further improve the efficiency. However, DL-based IE models often require large-scale labeled data for training, but PII-labeled social media posts are difficult to obtain due to privacy concerns. Also, these models rely heavily on pre-trained word embeddings, while PII in social media often varies in forms and thus has no fixed representations in pre-trained word embeddings. In this study, we propose the Deep Transfer Learning for PII Extraction (DTL-PIIE) framework to address these two limitations. DTL-PIIE transfers knowledge learned from publicly available PII data to social media in order to address the problem of rare PII-labeled data. Moreover, our framework leverages Graph Convolutional Networks (GCNs) to incorporate syntactic patterns to guide PIIE without relying on pre-trained word embeddings. Evaluation against benchmark IE models indicates that our approach outperforms state-of-the-art DL-based IE models. An ablation analysis further confirms the efficacy of each component in our model. Our proposed framework can facilitate various applications, such as PII misuse prediction and privacy risk assessment, thereby protecting the privacy of internet users.
AB - Internet users have been exposing an increasing amount of Personally Identifiable Information (PII) on social media. Such exposed PII can be exploited by cybercriminals and cause severe losses to the users. Informing users of their PII exposure in social media is crucial to raise their privacy awareness and encourage them to take protective measures. To this end, advanced techniques are needed to extract users' exposed PII in social media automatically, whereas most existing studies remain manual. While Information Extraction (IE) techniques can be used to extract the PII automatically, Deep Learning (DL)-based IE models alleviate the need for feature engineering and further improve the efficiency. However, DL-based IE models often require large-scale labeled data for training, but PII-labeled social media posts are difficult to obtain due to privacy concerns. Also, these models rely heavily on pre-trained word embeddings, while PII in social media often varies in forms and thus has no fixed representations in pre-trained word embeddings. In this study, we propose the Deep Transfer Learning for PII Extraction (DTL-PIIE) framework to address these two limitations. DTL-PIIE transfers knowledge learned from publicly available PII data to social media in order to address the problem of rare PII-labeled data. Moreover, our framework leverages Graph Convolutional Networks (GCNs) to incorporate syntactic patterns to guide PIIE without relying on pre-trained word embeddings. Evaluation against benchmark IE models indicates that our approach outperforms state-of-the-art DL-based IE models. An ablation analysis further confirms the efficacy of each component in our model. Our proposed framework can facilitate various applications, such as PII misuse prediction and privacy risk assessment, thereby protecting the privacy of internet users.
KW - PII
KW - deep transfer learning
KW - information extraction
KW - privacy
KW - social media
UR - http://www.scopus.com/inward/record.url?scp=85123473372&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123473372&partnerID=8YFLogxK
U2 - 10.1109/ISI53945.2021.9624678
DO - 10.1109/ISI53945.2021.9624678
M3 - Conference contribution
T3 - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
BT - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 November 2021 through 3 November 2021
ER -