During the last decades, the amount of information available for researches has increased
several fold, making the searches more difficult. Thus, Information Retrieval Systems (IR)
are needed.
In this master thesis, a tool has been developed to create a dataset with metadata of
scientific articles. This tool parses the articles of Pubmed, extracts metadata from them
and saves the metadata in a relational database. Once all the articles have been parsed, the
tool generates three XML files with that metadata: Articles.xml, ExtendedArticles.xml and
Citations.xml. The first file contains the title, authors and publication date of the parsed
articles and the articles referenced by them. The second one contains the abstract,
keywords, body and reference list of the parsed articles. Finally, the file Citations.xml file
contains the citations found within the articles and their context.
The tool has been used to parse 45.000 articles. After the parsing, the database contains
644.906 articles with their title, authors and publication date. The articles of the dataset
form a digraph where the articles are the nodes and the references are the arcs of the
digraph. The in-degree of the network follows a power law distribution: there is an small
set of articles referenced very often while most of the articles are rarely referenced.
Two IR systems have been developed to search the dataset: the Title Based IR and the
Citation Based IR. The first one compares the query of the user to the title of the articles,
computes the Jaccard index as a similarity measure and ranks the articles according to
their similarity. The second IR compares the query to the paragraphs where the citations
were found. The analysis of both IRs showed that the execution time needed by the
Citation Based IR was bigger. Nevertheless, the recommendations given were much better,
which proved that the parsing of the citations was worth it.