The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

Keywords：Text categorization, Real-world corpus, Deep learning, Performance evaluation

Abstract：The Feasibility of Automated Topic AnalysisAn Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification Text classificationTCis the task of assigning predefined categoriesor labelsto texts for information organizationknowledge managementand many other applicationsNormally the categories are topical in library science applicationsalthough they can be any labels suitable for an applicationThusTC often requires topical analysis which relies on human knowledgeHoweverin recent decadesmachine learningMLtechniques have been applied to TC for efficiencyas long as a sufficient number of training texts are available for each categoryNeverthelessin real-world casesthe number of textsdocumentsfor each category is often highly skewed for a certain TC taskThis leads to the problem of predicting labels for small categorieswhich is viable for humans but challenging for machinesDeep learningDLis an emerging class of machine learningMLwhich was inspired by human neural networksThis study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four offthe-shelf DL methodsCNNRCNNfastTextand BERTwith four traditional ML techniques on five skew-distributed datasetsfour in Chineseand one in English for comparisonOur results show that BERT is effective for moderately skewed datasetsbut is still not feasible for highly skewed TC tasksThe other three DL-aware methodsCNNRCNNfastTextdo not show any advantage in comparison with traditional methods such as SVM for the five TC tasksalthough they captured extra language knowledge in the pretrained word representationTo facilitate future studyall of the Chinese datasets used in this study have been released publiclytogether with all of the adapted machine learning and evaluation source codes for verification and for further study at ClassificationSince TC is a task that distributes predefined categories to textit can be related to collectionsRetriever and IndexingThe application of TC in information retrieval is mentioned in the articleso it is related to retrieval and indexThis article mainly describes text classificationTCwhich is a task that distributes preset categoriesor labelsto text for various applications such as information organization and knowledge managementIn recent decadesmachine learning technology has been applied to TC to improve efficiencyand deep learning has been an emerging machine learning technologyThe purpose of research is to evaluate whether deep learning technology is applicable to the above problems

The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

網站意見回饋

網站意見回饋

聯絡我們

聯絡我們