TY - GEN
T1 - Enhanced dataset synthesis using CTGAN for metagenomic dataset
AU - Ince, Volkan
AU - Bader-El-Den, Mohamed
AU - Sari, Omer Faruk
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/10/9
Y1 - 2024/10/9
N2 - The examination of bacterial communities has increasingly relied on machine learning methods and metagenomic analysis, providing novel solutions across various domains. However, the restricted size of metagenomic datasets presents challenges for robust model training. Consequently, data augmentation techniques, such as Conditional Tabular Generative Adversarial Networks (CTGAN), have obtained attention. This study seeks to utilize machine learning algorithms, incorporating CTGAN, to assess the influence of microbial community composition on the growth patterns of Clostridium bacteria in metagenomic dataset. Additionally, the study employs SHAP analysis to explain feature importance and contrast model performance pre- and post-data augmentation. The findings demonstrate notable enhancements in classification metrics subsequent to data augmentation, particularly evident when excluding the 'Day' feature. Moreover, SHAP analysis identifies pivotal features, notably the absence of the 'Day' variable post-CTGAN synthesis, emphasizing the significance of specific bacterial genera like Clostridium in bacterial growth dynamics. Overall, this study underscores the efficacy of data augmentation techniques, specifically CTGAN, in enhancing machine learning model performance for metagenomic data classification tasks, with implications for refining food safety and healthcare protocols. Further research could explore advanced data augmentation methodologies and validate outcomes on more expansive datasets for practical implementation.
AB - The examination of bacterial communities has increasingly relied on machine learning methods and metagenomic analysis, providing novel solutions across various domains. However, the restricted size of metagenomic datasets presents challenges for robust model training. Consequently, data augmentation techniques, such as Conditional Tabular Generative Adversarial Networks (CTGAN), have obtained attention. This study seeks to utilize machine learning algorithms, incorporating CTGAN, to assess the influence of microbial community composition on the growth patterns of Clostridium bacteria in metagenomic dataset. Additionally, the study employs SHAP analysis to explain feature importance and contrast model performance pre- and post-data augmentation. The findings demonstrate notable enhancements in classification metrics subsequent to data augmentation, particularly evident when excluding the 'Day' feature. Moreover, SHAP analysis identifies pivotal features, notably the absence of the 'Day' variable post-CTGAN synthesis, emphasizing the significance of specific bacterial genera like Clostridium in bacterial growth dynamics. Overall, this study underscores the efficacy of data augmentation techniques, specifically CTGAN, in enhancing machine learning model performance for metagenomic data classification tasks, with implications for refining food safety and healthcare protocols. Further research could explore advanced data augmentation methodologies and validate outcomes on more expansive datasets for practical implementation.
KW - Explainable AI
KW - Generative AI
KW - Metagenomic data
KW - Supervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=85208434008&partnerID=8YFLogxK
U2 - 10.1109/IS61756.2024.10705275
DO - 10.1109/IS61756.2024.10705275
M3 - Conference contribution
AN - SCOPUS:85208434008
SN - 9798350350999
T3 - IEEE International Conference on Intelligent Systems
SP - 1
EP - 6
BT - 2024 IEEE 12th International Conference on Intelligent Systems, IS 2024 - Proceedings
A2 - Sgurev, Vassil
A2 - Jotsov, Vladimir
A2 - Piuri, Vincenzo
A2 - Doukovska, Luybka
A2 - Yoshinov, Radoslav
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th IEEE International Conference on Intelligent Systems, IS 2024
Y2 - 29 August 2024 through 31 August 2024
ER -