Abstract
Text classification remains a challenging task in natural language processing (NLP) due to linguistic complexity and data imbalance. This study proposes a hybrid approach that integrates grammar-based feature engineering with deep learning and transformer models to enhance classification performance. A dataset of factoid and non-factoid questions, further categorised into causal, choice, confirmation, hypothetical, and list types, is used to evaluate several models, including CNNs, BiLSTMs, MLPs, BERT, DistilBERT, Electra, and GPT-2. Grammatical and domain-specific features are explicitly extracted and leveraged to improve multi-class classification. To address class imbalance, the SMOTE algorithm is applied, significantly boosting the recall and F1-score for minority classes. Experimental results show that DistilBERT achieves the highest binary classification accuracy, equal to 94%, while BiLSTM and CNN outperform transformers in multi-class settings, reaching up to 92% accuracy. These findings confirm that grammar-based features provide critical syntactic and semantic insights, enhancing model robustness and interpretability beyond conventional embeddings.
| Original language | English |
|---|---|
| Article number | 424 |
| Number of pages | 26 |
| Journal | Information |
| Volume | 16 |
| Issue number | 6 |
| Early online date | 22 May 2025 |
| DOIs | |
| Publication status | Published - 1 Jun 2025 |
Keywords
- Text Classification
- Deep Learning
- Transformer Models
- Grammar-Based 15 Feature Engineering
- ; Natural Language Processing (NLP)
- SMOTE
- Question Classification