Granitzer Michael, Kienreich Wolfgang, Sabol Vedran, Lex Elisabeth
2010
Technological advances and paradigmatic changes in the utilization of the World Wide Web havetransformed the information seeking strategies of media consumers and invalidated traditionalbusiness models of media providers. We discuss relevant aspects of this development and presenta knowledge relationship discovery pipeline to address the requirements of media providers andmedia consumers. We also propose visually enhanced access methods to bridge the gap betweencomplex media services and the information needs of the general public. We conclude that acombination of advanced processing methods and visualizations will enable media providers totake the step from content-centered to service-centered business models and, at the same time,will help media consumers to better satisfy their personal information needs.
Kern Roman, Granitzer Michael, Muhr M.
2010
Word sense induction and discrimination(WSID) identifies the senses of an ambiguousword and assigns instances of thisword to one of these senses. We have builda WSID system that exploits syntactic andsemantic features based on the results ofa natural language parser component. Toachieve high robustness and good generalizationcapabilities, we designed our systemto work on a restricted, but grammaticallyrich set of features. Based on theresults of the evaluations our system providesa promising performance and robustness.
Kern Roman, Granitzer Michael, Muhr M.
2010
Cluster label quality is crucial for browsing topic hierarchiesobtained via document clustering. Intuitively, the hierarchicalstructure should influence the labeling accuracy. However,most labeling algorithms ignore such structural propertiesand therefore, the impact of hierarchical structureson the labeling accuracy is yet unclear. In our work weintegrate hierarchical information, i.e. sibling and parentchildrelations, in the cluster labeling process. We adaptstandard labeling approaches, namely Maximum Term Frequency,Jensen-Shannon Divergence, χ2 Test, and InformationGain, to take use of those relationships and evaluatetheir impact on 4 different datasets, namely the Open DirectoryProject, Wikipedia, TREC Ohsumed and the CLEFIP European Patent dataset. We show, that hierarchicalrelationships can be exploited to increase labeling accuracyespecially on high-level nodes.
Lex Elisabeth, Granitzer Michael, Juffinger A.
2010
In the blogosphere, the amount of digital content is expanding and for search engines, new challenges have been imposed. Due to the changing information need, automatic methods are needed to support blog search users to filter information by different facets. In our work, we aim to support blog search with genre and facet information. Since we focus on the news genre, our approach is to classify blogs into news versus rest. Also, we assess the emotionality facet in news related blogs to enable users to identify people’s feelings towards specific events. Our approach is to evaluate the performance of text classifiers with lexical and stylometric features to determine the best performing combination for our tasks. Our experiments on a subset of the TREC Blogs08 dataset reveal that classifiers trained on lexical features perform consistently better than classifiers trained on the best stylometric features.
Lex Elisabeth, Granitzer Michael, Juffinger A.
2010
In this paper, we outline our experiments carried out at the TREC 2009 Blog Distillation Task. Our system is based on a plain text index extracted from the XML feeds of the TREC Blogs08 dataset. This index was used to retrieve candidate blogs for the given topics. The resulting blogs were classified using a Support Vector Machine that was trained on a manually labelled subset of the TREC Blogs08 dataset. Our experiments included three runs on different features: firstly on nouns, secondly on stylometric properties, and thirdly on punctuation statistics. The facet identification based on our approach was successful, although a significant number of candidate blogs were not retrieved at all.
Granitzer Michael, Kienreich Wolfgang
2010
Granitzer Michael
2010
Term weighting strongly influences the performance of text miningand information retrieval approaches. Usually term weights are determined throughstatistical estimates based on static weighting schemes. Such static approacheslack the capability to generalize to different domains and different data sets. Inthis paper, we introduce an on-line learning method for adapting term weightsin a supervised manner. Via stochastic optimization we determine a linear transformationof the term space to approximate expected similarity values amongdocuments. We evaluate our approach on 18 standard text data sets and showthat the performance improvement of a k-NN classifier ranges between 1% and12% by using adaptive term weighting as preprocessing step. Further, we provideempirical evidence that our approach is efficient to cope with larger problems