Automatic classification of older electronic texts into the Universal Decimal Classification–UDC

Automatic classification of older electronic texts into the Universal Decimal Classification–UDC

Keywords:Digital library, Artificial intelligence, Machine learning, Text classification, Older texts, Universal Decimal Classification
Abstract:Automatic classification of older electronic texts into the Universal Decimal ClassificationUDC PurposeThe purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal ClassificationUDCusing machine-learning methodsDesignmethodologyapproachThe general research approach is inherent to design science researchin which the problem of UDC assignment of the olddigitised texts is addressed by developing a machine-learning classification modelA corpus of 70000 scholarly textsfully bibliographically processed by librarianswas used to train and test the modelwhich was used for classification of old texts on a corpus of 200000 itemsHuman experts evaluated the performance of the modelFindingsResults suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly textFurthermorethe model can be recommended for the UDC assignment of older textsTen librarians corroborated this on 150 randomly selected textsResearch limitationsimplicationsThe main limitations of this study were unavailability of labelled older texts and the limited availability of librariansPractical implicationsThe classification model can provide a recommendation to the librarians during their classification workfurthermoreit can be implemented as an add-on to full-text search in the library databasesSocial implicationsThe proposed methodology supports librarians by recommending UDC classifiersthus saving time in their daily workBy automatically classifying older textsdigital libraries can provide a better user experience by enabling structured searchesThese contribute to making knowledge more widely available and useableOriginalityvalueThese findings contribute to the field of automated classification of bibliographical information with the usage of full textsespecially in cases in which the texts are oldunstructured and in which archaic language and vocabulary are used Collection ClassificationAutomatically classify the old digital text into universal decimal classificationUDCthrough machine learning modelsThe purpose of this study is to use machine learning methods to automatically classify old digital text into universal decimal classificationUDCResearchers use 70000 academic text training and test modelsand use this model to classify 200000 texts