TY - GEN
T1 - BotHunter
T2 - 2022 Mining Software Repositories Conference, MSR 2022
AU - Abdellatif, Ahmad
AU - Wessel, Mairieli
AU - Steinmacher, Igor
AU - Gerosa, Marco A.
AU - Shihab, Emad
N1 - Publisher Copyright: © 2022 ACM.
PY - 2022
Y1 - 2022
N2 - Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort to practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository and disregarding features that showed to be effective in other domains. To address this gap, we propose using a machine learning-based approach to identify the bot accounts regardless of their activity level. We selected and extracted 19 features related to the account's profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best, with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) contains the most relevant features to identify the account type. Finally, we compare the performance of our Random Forest classifier to the state-of-the-art approaches, and our results show that our model outperforms the state-of-the-art techniques in identifying the account type regardless of their activity level.
AB - Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort to practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository and disregarding features that showed to be effective in other domains. To address this gap, we propose using a machine learning-based approach to identify the bot accounts regardless of their activity level. We selected and extracted 19 features related to the account's profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best, with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) contains the most relevant features to identify the account type. Finally, we compare the performance of our Random Forest classifier to the state-of-the-art approaches, and our results show that our model outperforms the state-of-the-art techniques in identifying the account type regardless of their activity level.
KW - Empirical Software Engineering
KW - Software Bots
UR - http://www.scopus.com/inward/record.url?scp=85134002263&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134002263&partnerID=8YFLogxK
U2 - 10.1145/3524842.3527959
DO - 10.1145/3524842.3527959
M3 - Conference contribution
T3 - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
SP - 6
EP - 17
BT - Proceedings - 2022 Mining Software Repositories Conference, MSR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 May 2022 through 24 May 2022
ER -