TY - GEN
T1 - Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis
AU - Zhou, Lixi
AU - Yu, Lei
AU - Zou, Jia
AU - Min, Hong
N1 - Publisher Copyright: © 2023 ACM.
PY - 2023/7/10
Y1 - 2023/7/10
N2 - Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.
AB - Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.
UR - http://www.scopus.com/inward/record.url?scp=85173463366&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85173463366&partnerID=8YFLogxK
U2 - 10.1145/3603719.3603734
DO - 10.1145/3603719.3603734
M3 - Conference contribution
T3 - ACM International Conference Proceeding Series
BT - Scientific and Statistical Database Management - 35th International Conference, SSDBM 2023 - Proceedings
A2 - Schuler, Robert
A2 - Kesselman, Carl
A2 - Chard, Kyle
A2 - Bugacov, Alejandro
PB - Association for Computing Machinery
T2 - 35th International Conference on Scientific and Statistical Database Management, SSDBM 2023
Y2 - 10 July 2023 through 12 July 2023
ER -