Keyphrase Extraction and Its Applications to Digital Libraries

Keyphrase Extraction and Its Applications to Digital Libraries

Keywords:Deep Learning, Natural Language, Digital Libraries, Keyword Extraction, Machine Learning, Text Classification, Web Archiving
Abstract:Keyphrase Extraction and Its Applications to Digital Libraries Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchersMoreoverthey are very useful in many applications such as document and citation recommendationexpert searchscientific paper summarizationcollaborator recommendationtopic classificationand keyphrase extractionDespite the advancements in search engine featuresranking methodstechnologiesand the availability of programmable APIscurrent-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collectionsFurthermorekeyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collectionsKeyphrases are useful in many applications such as document indexing and summarizationtopic trackingcontextual advertisingand opinion miningHoweverkeyphrases are not always provided with the papersbut they need to be extracted from their contentA growing number of scholarly digital librariesmuseumsand archives around the world are embracing web archiving as a mechanism to collect born-digital material made available via the webTo create the specialized collection from the Web archived datathere is a substantial need for automatic approaches that can distinguish the documents of interest for a collectionIn this dissertationwe first explore keyphrase extraction as a supervised task and formulated as sequence labeling and utilize the power of Conditional Random Fields in capturing label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring labelOur proposed CRF-based supervised approach exploits word embeddings as features along with traditionaldocument-specific featuresOur results on five datasets of research papers show that the word embeddings combined with document-specific features achieve high performance and outperform strong baselines for this taskWe also propose KPRankan unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRankOur experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this taskFurthermorewe investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX digital library for extracting high-quality keyphrasesFurtherwe propose a novel search-driven framework for acquiring documents for such scientific portalsWithin our frameworkpublicly-available research paper titles and author names are used as queries to a Web search engineWe were able to obtain267000 unique research papers through our fully-automated framework using76000 queriesresulting in almost 200000 more papers than the number of queriesFurthermoreWe propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documentsWe use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifierOur experiments on a large-scale dataset highlight the effectiveness of our approachand position Web search as an effective method for acquiring authorshomepagesFinallywe explore different learning models and feature representations to determine the best-performing ones for identifying the documents of interest from the web archived dataSpecificallywe study both machine learning and deep learning models andbag of wordsBoWfeatures extracted from the entire document or from specific portions of the documentas well as structural features that capture the structure of documentsMoreoverwe explore dynamic fusion models to findon the flythe model or combination of models that perform best on a variety of document typesWe proposed two dynamic classifier selection algorithmsDynamic Classifier Selection for Document Classificationor DCSDCand Dynamic Decision level Fusion for Document Classificationor DDFCOur experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on all three datasetsLibrary AutomationResearch mentioned automatic methodssuch as CRF and KPRANKfor key phrase extractionCollection ClassificationThe paper explores key phrases extractionwhich can help classify documentsRetrieval and IndexingBecause the paper involves key phrase extractionthis is useful for the index and search of the documentInstificational CollectionStudy mentioned the CiteseerX Digital Library and explores how to extract high-quality key phrasesThis paper explores key phrases extraction in the academic digital library and uses machine learning methods such as conditional random fieldsCRFand KPRANKIn additionthe paper also explores deep learning models andBAG of WordsBOWfeatures to identify interesting documents in network archived dataFor NLPthe paper has dealt with problems such as key phrases extraction and automatic documentation