A customizable grammar-based framework for user-intent text classification
Student thesis: Doctoral Thesis
In real-life classification problems, prior information about the problem and expert knowledge about the domain are often used to obtain reliable and consistent solutions. This is especially true in fields where the data is ambiguous, such as text, in which the same words can be used in seemingly similar texts but have a different meaning. Many of the proposed approaches rely on the bag-of-words representation, which loses the information about the structure of the text. In this thesis, a literature review of related works in text classification is provided which includes an overview of text classification methods. In addition, detailed review of related works of two text classification domains; search engines and question answering systems. The core contribution is divided into three main parts. The first contribution is the Customizable Grammar Framework for user-intent text classification (CGF) which employs a formal grammar approach and exploits domain-related information in a new way to represent text as a series of syntactic categories forming syntactic patterns. In addition, the proposed framework has been applied to different domains which resulted in the second and third contribution. The second contribution is the Grammar-Based Framework for Query Classification (GQC) which helped in the improvement of query identification and classification. The third contribution is the Grammar-Based Framework for Question Categorization and Classification (GQCC) which helped in the enhancement of question identification and classification. In addition, using different machine learning algorithms the overall results show that the proposed approach outperforms previous ones in terms of classification performance for query and question classifications. Finally, comparison of the classification performance with the state-of-the-art approaches has been conducted, results validate that the proposed approach improves the classification accuracy and the identification of the different types of queries and questions.