Lex Elisabeth, Granitzer Michael, Juffinger A.
2010
A Comparison of Stylometric and Lexical Features for Web Genre Classification and Emotion Classification in Blogs
IEEE Computer Society: 7th International Workshop on Text-based Information Retrieval in Procceedings of 21th International Conference on Database and Expert Systems Applications (DEXA 10). IEEE
In the blogosphere, the amount of digital content is expanding and for search engines, new challenges have been imposed. Due to the changing information need, automatic methods are needed to support blog search users to filter information by different facets. In our work, we aim to support blog search with genre and facet information. Since we focus on the news genre, our approach is to classify blogs into news versus rest. Also, we assess the emotionality facet in news related blogs to enable users to identify people’s feelings towards specific events. Our approach is to evaluate the performance of text classifiers with lexical and stylometric features to determine the best performing combination for our tasks. Our experiments on a subset of the TREC Blogs08 dataset reveal that classifiers trained on lexical features perform consistently better than classifiers trained on the best stylometric features.
Lex Elisabeth, Granitzer Michael, Juffinger A.
2010
Facet Classification of Blogs: Know-Center at the TREC 2009 Blog Distillation Task
Proceedings of the 18th Text REtrieval Conference
In this paper, we outline our experiments carried out at the
TREC 2009 Blog Distillation Task. Our system is based on a plain text
index extracted from the XML feeds of the TREC Blogs08 dataset. This
index was used to retrieve candidate blogs for the given topics. The
resulting blogs were classified using a Support Vector Machine that was
trained on a manually labelled subset of the TREC Blogs08 dataset. Our
experiments included three runs on different features: firstly on nouns,
secondly on stylometric properties, and thirdly on punctuation statistics.
The facet identification based on our approach was successful, although
a significant number of candidate blogs were not retrieved at all.
Granitzer Michael, Lex Elisabeth, Juffinger A.
2009
Blog Credibility Ranking by Exploiting Verified Content
Proceedings of the 3rd Workshop on Information Credibility on the Web at 18th World Wide Web Conference
People use weblogs to express thoughts, present ideas and
share knowledge. However, weblogs can also be misused to
influence and manipulate the readers. Therefore the credibility
of a blog has to be validated before the available information
is used for analysis. The credibility of a blogentry
is derived from the content, the credibility of the author or
blog itself, respectively, and the external references or trackbacks.
In this work we introduce an additional dimension
to assess the credibility, namely the quantity structure. For
our blog analysis system we derive the credibility therefore
from two dimensions. Firstly, the quantity structure of a set
of blogs and a reference corpus is compared and secondly, we
analyse each separate blog content and examine the similarity
with a verified news corpus. From the content similarity
values we derive a ranking function. Our evaluation showed
that one can sort out incredible blogs by quantity structure
without deeper analysis. Besides, the content based ranking
function sorts the blogs by credibility with high accuracy.
Our blog analysis system is therefore capable of providing
credibility levels per blog.
Lex Elisabeth, Juffinger A.
2009
People use weblogs to express thoughts, present ideas and
share knowledge, therefore weblogs are extraordinarily valuable
resources, amongs others, for trend analysis. Trends are
derived from the chronological sequence of blog post count
per topic. The comparison with a reference corpus allows
qualitative statements over identified trends. We propose a
crosslanguage blog mining and trend visualisation system to
analyse blogs across languages and topics. The trend visualisation
facilitates the identification of trends and the comparison
with the reference news article corpus. To prove the
correctness of our system we computed the correlation between
trends in blogs and news articles for a subset of blogs
and topics. The evaluation corroborated our hypothesis of
a high correlation coefficient for these subsets and therefore
the correctness of our system for different languages and
topics is proven.