Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman
2015
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques. To capture stylistic changes in the text, we employ a number of stylometric features. We evaluate our approach on a small subset of PubMed articles consisting of an approximately equal number of research articles written by a varying number of authors. Our results indicate that the more authors an article has the more potential author changes are identified. These results can be considered as an initial step towards a more detailed analysis of scientific authorship, thereby extending the repertoire of bibliometrics.
Klampfl Stefan, Kern Roman
2015
Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output.However, contextual information, such as the authors' affiliations, references, and funding agencies, is typically hidden within PDF files.To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques.First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article.Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations.Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings.Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference.Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects.Our system is modular in nature.Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.