Hier finden Sie von Know-Center MitarbeiterInnen verfasste wissenschaftliche Publikationen


Lovric Mario, Molero Perez Jose Manuel, Kern Roman

PySpark and RDKit: moving towards Big Data in QSAR

Molecular Informatics, Wiley, 2019

The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations

Lovric Mario, Žuvela Petar, Kern Roman, Lucic, Bono, J. Jay Liu, Tomasz Bączek

Machine learning methods for cross-column prediction of retention time in reversed-phased liquid chromatography

8th World Conference on Physico Chemical Methods in Drug Discovery and Developmen, IAPC, Split, Croatia, 2019

Quantitative structure-retention relationships (QSRR) were employed to build global models for prediction of chromatographic retention time of synthetic peptides across six RP-LC-MS/MS columns and varied experimental conditions. The global QSRR models were based on only three a priori selected molecular descriptors: sum of gradient retention times of 20 natural amino acids (logSumAA), van der Waals volume (logvdWvol.), and hydrophobicity (clogP) related to the retention mechanism of RP-LC separation of peptides. Three machine learning regression methods were compared: random forests (RF), partial least squares (PLS), and adaptive boosting (ADA). All the models were comprehensively optimized through 3-fold cross-validation (CV) and validated through an external validation set. The chemical domain of applicability was also defined. Percentage root mean square error of prediction (%RMSEP) was used as an external validation metric. Results have shown that RF exhibited a %RMSEP of 14.99 %; PLS exhibited a %RMSEP of 40.561 %; whereas ADA exhibited a %RMSEP of 26.35 %. The ensemble models considerably outperform the conventional PLS-based QSRR model. Novel methods of tree-based model explainability were employed to reveal mechanisms behind black-box global ensemble QSRR models. The models revelead the highest feature importance for sum of gradient retention times (logSumAA), followed by van der Waals volume (logvdWvol.), and hydrophobicity (clogP). The promising results of this study show the potential of machine learning for improved peptide identification, retention time standardization and integration into state-of-the-art LC-MS/MS proteomics workflows.
Kontakt Karriere

Hiermit erkläre ich ausdrücklich meine Einwilligung zum Einsatz und zur Speicherung von Cookies. Weiter Informationen finden sich unter Datenschutzerklärung

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.