Publikationen

Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen

2018

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Authorship Identification of Documents with High Content Similarity

Scientometrics, Wolfgang Glänzel, Springer Link, 2018

Journal
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e., authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors.Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper.Both studies confirmed that this task is quite challenging.To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (i) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (ii) assist forensic experts or linguists to create profiles of writers, (iii) support intelligence applications to analyze aggressive and threatening messages and (iv) help editor conformity by adhering to, for instance, journal specific writing style.
2018

Rexha Andi, Dragoni Mauro , Federici Marco

An Unsupervised Aspect Extraction Strategy For Monitoring Real-Time Reviews Stream

Elsevier, 2018

Journal
One of the most important opinion mining research directions falls in the extraction ofpolarities referring to specific entities (aspects) contained in the analyzed texts. Thedetection of such aspects may be very critical especially when documents come fromunknown domains. Indeed, while in some contexts it is possible to train domainspecificmodels for improving the effectiveness of aspects extraction algorithms, inothers the most suitable solution is to apply unsupervised techniques by making suchalgorithms domain-independent and more efficient in a real-time environment. Moreover,an emerging need is to exploit the results of aspect-based analysis for triggeringactions based on these data. This led to the necessity of providing solutions supportingboth an effective analysis of user-generated content and an efficient and intuitive wayof visualizing collected data. In this work, we implemented an opinion monitoringservice implementing (i) a set of unsupervised strategies for aspect-based opinion miningtogether with (ii) a monitoring tool supporting users in visualizing analyzed data.The aspect extraction strategies are based on the use of an open information extractionstrategy. The effectiveness of the platform has been tested on benchmarks provided by the SemEval campaign and have been compared with the results obtained by domainad aptedtechniques.
2017

Rexha Andi, Kern Roman, Ziak Hermann, Dragoni Mauro

A semantic federated search engine for domain-specific document retrieval

SAC '17 Proceedings of the Symposium on Applied Computing, Sung Y. Shin, Dongwan Shin, Maria Lencastre, ACM, Marrakech, Morocco, 2017

Konferenz
Retrieval of domain-specific documents became attractive for theSemantic Web community due to the possibility of integrating classicInformation Retrieval (IR) techniques with semantic knowledge.Unfortunately, the gap between the construction of a full semanticsearch engine and the possibility of exploiting a repository ofontologies covering all possible domains is far from being filled.Recent solutions focused on the aggregation of different domain-specificrepositories managed by third-parties. In this paper, wepresent a semantic federated search engine developed in the contextof the EEXCESS EU project. Through the developed platform,users are able to perform federated queries over repositories in atransparent way, i.e. without knowing how their original queries aretransformed before being actually submitted. The platform implementsa facility for plugging new repositories and for creating, withthe support of general purpose knowledge bases, knowledge graphsdescribing the content of each connected repository. Such knowledgegraphs are then exploited for enriching queries performed byusers.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Extending Scientific Literature Search by Including the Author’s Writing Style

Fifth Workshop on Bibliometric-enhanced Information Retrieval, Atanassova, I.; Bertin, M.; Mayr, P., Springer, Aberdeen, UK, 2017

Konferenz
Our work is motivated by the idea to extend the retrieval of related scientific literature to cases, where the relatedness also incorporates the writing style of individual scientific authors. Therefore we conducted a pilot study to answer the question whether humans can identity authorship once the topological clues have been removed. As first result, we found out that this task is challenging, even for humans. We also found some agreement between the annotators. To gain a better understanding how humans tackle such a problem, we conducted an exploratory data analysis. Here, we compared the decisions against a number of topological and stylometric features. The outcome of our work should help to improve automatic authorship identificationalgorithms and to shape potential follow-up studies.
2017

Kern Roman, Falk Stefan, Rexha Andi

Know-Center at SemEval-2017 Task 10: Sequence Classification with the CODE Annotator

Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, Andrew McCallum, ACL, Vancouver, Canada, 2017

Konferenz
This paper describes our participation inSemEval-2017 Task 10, named ScienceIE(Machine Reading for Scientist). We competedin Subtask 1 and 2 which consist respectivelyin identifying all the key phrasesin scientific publications and label them withone of the three categories: Task, Process,and Material. These scientific publicationsare selected from Computer Science, MaterialSciences, and Physics domains. We followeda supervised approach for both subtasksby using a sequential classifier (CRF - ConditionalRandom Fields). For generating oursolution we used a web-based application implementedin the EU-funded research project,named CODE. Our system achieved an F1score of 0.39 for the Subtask 1 and 0.28 forthe Subtask 2.
2017

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

Pilot study: Ranking of textual snippets based on the writing style

Zenodo, 2017

In this pilot study, we tried to capture humans' behavior when identifying authorship of text snippets. At first, we selected textual snippets from the introduction of scientific articles written by single authors. Later, we presented to the evaluators a source and four target snippets, and then, ask them to rank the target snippets from the most to the least similar from the writing style.The dataset is composed by 66 experiments manually checked for not having any clear hint during the ranking for the evaluators. For each experiment, we have evaluations from three different evaluators.We present each experiment in a single line (in the CSV file), where, at first we present the metadata of the Source-Article (Journal, Title, Authorship, Snippet), and the metadata for the 4 target snippets (Journal, Title, Authorship, Snippet, Written From the same Author, Published in the same Journal) and the ranking given by each evaluator. This task was performed in the open source platform, Crowd Flower. The headers of the CSV are self-explained. In the TXT file, you can find a human-readable version of the experiment. For more information about the extraction of the data, please consider reading our paper: "Extending Scientific Literature Search by Including the Author’s Writing Style" @BIR: http://www.gesis.org/en/services/events/events-archive/conferences/ecir-workshops/ecir-workshop-2017
2016

Ziak Hermann, Rexha Andi, Kern Roman

KNOW At The Social Book Search Lab 2016 Mining Track

CLEF 2016 Social Book Search Lab, Krisztian Balog, Linda Cappellato, Nicola Ferro,Craig Macdonald, Springer, Évora, Portugal, 2016

Konferenz
This paper describes our system for the mining task of theSocial Book Search Lab in 2016. The track consisted of two task, theclassification of book request postings and the task of linking book identifierswith references mentioned within the text. For the classificationtask we used text mining features like n-grams and vocabulary size, butalso included advanced features like average spelling errors found withinthe text. Here two datasets were provided by the organizers for this taskwhich were evaluated separately. The second task, the linking of booktitles to a work identifier, was addressed by an approach based on lookuptables. For the dataset of the first task our approach was ranked third,following two baseline approaches of the organizers with an accuracy of91 percent. For the second dataset we achieved second place with anaccuracy of 82 percent. Our approach secured the first place with anF-score of 33.50 for the second task.
2016

Dragoni Mauro, Rexha Andi, Kröll Mark, Kern Roman

Polarity Classification for Target Phrases in Tweets: A Word2Vec approach

The Semantic Web, ESWC 2016 Satellite Events, ESWC 2016, Springer-Verlag, Crete, Greece, 2016

Konferenz
Twitter is one of the most popular micro-blogging serviceson the web. The service allows sharing, interaction and collaboration viashort, informal and often unstructured messages called tweets. Polarityclassification of tweets refers to the task of assigning a positive or a nega-tive sentiment to an entire tweet. Quite similar is predicting the polarityof a specific target phrase, for instance@Microsoftor#Linux,whichiscontained in the tweet.In this paper we present a Word2Vec approach to automatically pre-dict the polarity of a target phrase in a tweet. In our classification setting,we thus do not have any polarity information but use only semantic infor-mation provided by a Word2Vec model trained on Twitter messages. Toevaluate our feature representation approach, we apply well-establishedclassification algorithms such as the Support Vector Machine and NaiveBayes. For the evaluation we used theSemeval 2016 Task #4dataset.Our approach achieves F1-measures of up to∼90 % for the positive classand∼54 % for the negative class without using polarity informationabout single words.
2016

Falk Stefan, Rexha Andi, Kern Roman

Know-Center at SemEval-2016 Task 5: Using Word Vectors with Typed Dependencies for Opinion Target Expression Extraction

Conference: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), SemEval 2016, ACL Anthology, San Diego, USA, 2016

Konferenz
This paper describes our participation in SemEval-2016 Task 5 for Subtask 1, Slot 2.The challenge demands to find domain specific target expressions on sentence level thatrefer to reviewed entities. The detection of target words is achieved by using word vectorsand their grammatical dependency relationships to classify each word in a sentence into target or non-target. A heuristic based function then expands the classified target words tothe whole target phrase. Our system achievedan F1 score of 56.816% for this task.
2016

Rexha Andi, Dragoni Mauro, Kern Roman, Kröll Mark

An Information Retrieval Based Approach for Multilingual Ontology Matching

International Conference on Applications of Natural Language to Information Systems, Métais E., Meziane F., Saraee M., Sugumaran V., Vadera S. , Springer , Salford, UK, 2016

Konferenz
Ontology matching in a multilingual environment consists of finding alignments between ontologies modeled by using more than one language. Such a research topic combines traditional ontology matching algorithms with the use of multilingual resources, services, and capabilities for easing multilingual matching. In this paper, we present a multilingual ontology matching approach based on Information Retrieval (IR) techniques: ontologies are indexed through an inverted index algorithm and candidate matches are found by querying such indexes. We also exploit the hierarchical structure of the ontologies by adopting the PageRank algorithm for our system. The approaches have been evaluated using a set of domain-specific ontologies belonging to the agricultural and medical domain. We compare our results with existing systems following an evaluation strategy closely resembling a recommendation scenario. The version of our system using PageRank showed an increase in performance in our evaluations.
2016

Kern Roman, Klampfl Stefan, Rexha Andi

Identifying Referenced Text in ScientificPublications by Summarisation andClassification Techniques

BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries, G. Cabanac, Muthu Kumar Chandrasekaran, Ingo Frommholz , Kokil Jaidka, Min-Yen Kan, Philipp Mayr, Dietmar Wolfram, ACM, New Jersey, USA, 2016

Konferenz
This report describes our contribution to the 2nd ComputationalLinguistics Scientific Document Summarization Shared Task (CLSciSumm2016), which asked to identify the relevant text span in a referencepaper that corresponds to a citation in another document that citesthis paper. We developed three different approaches based on summarisationand classification techniques. First, we applied a modified versionof an unsupervised summarisation technique, TextSentenceRank, to thereference document, which incorporates the similarity of sentences tothe citation on a textual level. Second, we employed classification to selectfrom candidates previously extracted through the original TextSentenceRankalgorithm. Third, we used unsupervised summarisation of therelevant sub-part of the document that was previously selected in a supervisedmanner.
2016

Pimas Oliver, Rexha Andi, Kröll Mark, Kern Roman

Profiling microblog authors using concreteness and sentiment - Know-Center at PAN 2016 author profiling

PAN 2016, Krisztian Balog, Linda Cappellato, Nicola Ferro, Craig Macdonald, Springer, Evora, Portugal, 2016

Konferenz
The PAN 2016 author profiling task is a supervised classification problemon cross-genre documents (tweets, blog and social media posts). Our systemmakes use of concreteness, sentiment and syntactic information present in thedocuments. We train a random forest model to identify gender and age of a document’sauthor. We report the evaluation results received by the shared task.
2016

Rexha Andi, Kröll Mark, Kern Roman

Social Media Monitoring for Companies: A 4W Summarisation Approach

European Conference on Knowledge Management, Dr. Sandra Moffett and Dr. Brendan Galbraith, Academic Conferences and Publishing International Limited, Belfast, Northern Ireland, UK, 2016

Konferenz
Monitoring (social) media represents one means for companies to gain access to knowledge about, for instance, competitors, products as well as markets. As a consequence, social media monitoring tools have been gaining attention to handle amounts of data nowadays generated in social media. These tools also include summarisation services. However, most summarisation algorithms tend to focus on (i) first and last sentences respectively or (ii) sentences containing keywords.In this work we approach the task of summarisation by extracting 4W (who, when, where, what) information from (social)media texts. Presenting 4W information allows for a more compact content representation than traditional summaries. Inaddition, we depart from mere named entity recognition (NER) techniques to answer these four question types by includingnon-rigid designators, i.e. expressions which do not refer to the same thing in all possible worlds such as “at the main square”or “leaders of political parties”. To do that, we employ dependency parsing to identify grammatical characteristics for each question type. Every sentence is then represented as a 4W block. We perform two different preliminary studies: selecting sentences that better summarise texts by achieving an F1-measure of 0.343, as well as a 4W block extraction for which we achieve F1-measures of 0.932; 0.900; 0.803; 0.861 for “who”, “when”, “where” and “what” category respectively. In a next step the 4W blocks are ranked by relevance. The top three ranked blocks, for example, then constitute a summary of the entire textual passage. The relevance metric can be customised to the user’s needs, for instance, ranked by up-to-dateness where the sentences’ tense is taken into account. In a user study we evaluate different ranking strategies including (i) up-todateness,(ii) text sentence rank, (iii) selecting the firsts and lasts sentences or (iv) coverage of named entities, i.e. based on the number of named entities in the sentence. Our 4W summarisation method presents a valuable addition to a company’s(social) media monitoring toolkit, thus supporting decision making processes.
2016

Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman

Towards a more fine grained analysis of scientific authorship: Predicting the number of authors using stylometric features

BIR 2016 Workshop on Bibliometric-enhanced Information Retrieval, Atanassova, I.; Bertin, M.; Mayr, P., Springer, Padova, Italy, 2016

Konferenz
To bring bibliometrics and information retrieval closer together, we propose to add the concept of author attribution into the pre-processing of scientific publications. Presently, common bibliographic metrics often attribute the entire article to all the authors affecting author-specific retrieval processes. We envision a more finegrained analysis of scientific authorship by attributing particular segments to authors. To realize this vision, we propose a new feature representation of scientific publications that captures the distribution of tylometric features. In a classification setting, we then seek to predict the number of authors of a scientific article. We evaluate our approach on a data set of ~ 6100 PubMed articles and achieve best results by applying random forests, i.e., 0.76 precision and 0.76 recall averaged over all classes.
2016

Rexha Andi, Kern Roman, Dragoni Mauro , Kröll Mark

Exploiting Propositions for Opinion Mining

ESWC-16 Challenge on Semantic Sentiment Analysis, Springer Link, Springer-Verlag, Crete, Greece, 2016

Konferenz
With different social media and commercial platforms, users express their opinion about products in a textual form. Automatically extracting the polarity (i.e. whether the opinion is positive or negative) of a user can be useful for both actors: the online platform incorporating the feedback to improve their product as well as the client who might get recommendations according to his or her preferences. Different approaches for tackling the problem, have been suggested mainly using syntactic features. The “Challenge on Semantic Sentiment Analysis” aims to go beyond the word-level analysis by using semantic information. In this paper we propose a novel approach by employing the semantic information of grammatical unit called preposition. We try to drive the target of the review from the summary information, which serves as an input to identify the proposition in it. Our implementation relies on the hypothesis that the proposition expressing the target of the summary, usually containing the main polarity information.
2015

Rexha Andi, Klampfl Stefan, Kröll Mark, Kern Roman

Towards Authorship Attribution for Bibliometrics using Stylometric Features

Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, Atanassova, I.; Bertin, M.; Mayr, P., ACL Anthology, Istanbul, Turkey, 2015

Konferenz
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques. To capture stylistic changes in the text, we employ a number of stylometric features. We evaluate our approach on a small subset of PubMed articles consisting of an approximately equal number of research articles written by a varying number of authors. Our results indicate that the more authors an article has the more potential author changes are identified. These results can be considered as an initial step towards a more detailed analysis of scientific authorship, thereby extending the repertoire of bibliometrics.
Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close