Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Keywords:Digital libraries and archives, Information extraction, Document layout analysis, Article segmentation, Named entity recognition, Digitized historical newspapers, Feuilleton extraction
Abstract:Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections Purpose Historical newspaper collections provide a wealth of information about the pastAlthough the digitization of these collections significantly improves their accessibilitya large portion of digitized historical newspaper collectionssuch as those of KBRthe Royal Library of Belgiumare not yet searchable at article-levelHoweverrecent developments in AI-based research methodssuch as document layout analysishave the potential for further enriching the metadata to improve the searchability of these historical newspaper collectionsThis paper aims to discuss the aforementioned issueDesignmethodologyapproach In this paperthe authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapersTo do thisthe authors propose a workflowusing computer vision and machine learning approaches to1provide article-level access to digitized historical newspaper collections using document layout analysis2extract specific types of articlesegfeuilletonsliterary supplements from Le Peuple from 19383conduct image similarity analysis usingunsupervised classification methods and4perform named entity recognitionNERto link the extracted information to open dataFindings The results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapersand also contributes to the building of corpora for digital humanities researchThe AI-based methods enable automatic extraction of feuilletonsclustering of similar images and dynamic linking of related articlesOriginalityvalue The proposed workflow enables automatic extraction of articlesincluding detection of a specific type of articlesuch as a feuilleton or literary supplementThis is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themesArticle-level access toand improved searchability ofKBRs digitized newspapers are demonstrated through the online tool and IndexingIt improves the availability and searchability of digital newspapers through articles level access and improved retrieval capabilitiesLibrary AutomationAutomatically extract specific types of articles and clusters of similar images through AICollection classificationThrough automatic identification and extraction of specific types of articlessuch as Feuilletons or literary supplementary materialsThis article mainly explores how to use computer vision and machine learning methods to improve the access to digital historical newspapersThefile layout analysismentioned in the article belongs to the category of mode recognitionit is mentioned that the use of machine learning methods for image similarity analysis and article classification belong to the scope of machine learningNaming entity identificationNERis part of NLP