GSoC '17 - MSDK-IO Library

Introduction

Mass Spectrometry Development Kit (MSDK; http://msdk.github.io ) is a Java library of algorithms for processing mass spectrometry data. It provides a flexible data model with Java interfaces for mass-spectrometry related objects, including raw spectra, processed data sets, identifications, etc. The goal of MSDK is to integrate multiple existing algorithms that are currently scattered around various Java-based graphical mass spectrometry tools (MZmine, Maltcms, MassCascade, Guineu, mzMatch, OpenChrom, mzJava, and others).

There are many file formats in use for mass spectrometry data. For interoperability of algorithms in Java-based processing workflows, there was an urgent need for a high-performance Java library that would provide support for reading and writing these common formats. In the past, several projects have developed custom parsers for various formats. However, none of the existing libraries provide a high-performance, random access to the data.

As a part of my Google Summer of Code 2017 project, I along with the help of my mentors Tomáš Pluskal, Dmitry Avtonomov and Adam Tenderholt, built native Java parsers with the ability to read using random access from the files of formats - mzML, mzXML and netCDF. As a part of this project, we also added functionality to the API to write to these formats which enables us to interconvert the data from one format to another.

The pull requests opened over the project, and the relevant discussion before they were merged can be found here - Pull Requests opened during the project

Description

The mzML file parser and writer

MSDK's mzML IO package which existed before the start of this project used a library called jmzML to parse the data. However, it could only read the data sequentially and was very slow.

We, started working on a new mzML parser which could make up for the drawbacks of the existing one. The new parser converts the give input mzML data into a stream, and the metadata is parsed first alongside, the position of the intensity, m/z and retention times arrays are noted. Later, this data is further parsed on-demand. All these operations are done without converting the stream to save computational time and memory.

A working mzML file writer which writes out indexed mzML files along with the checksum can also be found in this package

The mzXML file parser

The old mzXML parser present in the MSDK used a different parsing algorithm. A new parser which reads the data as a stream was used and added to the package.

The netCDF file parser and writer

Random access support was added to the existing netCDF parser and a new netCDF writer was built and added to the package.

Others

As a part of this project, we had to make a few changes to the XML parsing and writing API itself to suit our needs. You can find the API here - Javolution MSFTBX.

The existing Javolution XML writing API doesn't have the ability to get the location in the file it is writing to. Since, it is important to track the current position to store the indices of the scans, we added fields to the XML writer to track the location, which enabled us to support exporting of indexed mzML files.

Testing

Aapart from the test cases included in the main MSDK project, we setup a separate repository with big data sets to allow testing with a higher number of files. The data parsed by the new parsers was asserted by comparing it with the actual value found in the file or the results obtained by old parser. The JUnit test cases coverage results are as follows -

Benchmark comparisions for the old MzML parser vs the new MzML parser was conducted. The runtimes correspond to that of the same JUnit tests and was averaged over 10 runs. The results are as follows -

Test File	Runtime with old parser (ms)	Runtime with new parser (ms)
5peptideFT.mzml	361	97
MzMLFile_7.mzml (Compressed and uncompressed)	58	8
emptyScan.mzml	37	10
mzML_with_UV.mzml	274	198
RawCentriodCidWithMsLevelInRefParamGroup.mzml	225	44
tiny.pwiz.mzml	18	15
SRM.mzml	114	47
MzValues_Zlib+Numpress.mzml	77	46

Future Work

IO modules for the following formats -

imzML
mz5
mzDB

IO modules for the following feature list formats -

mzTab
featureXML
peakML

Multi-threading support for existing modules
Support for raw vendor formats
On-disk caching for better performance

Useful links

Commits to main repo during the project
Original Javolution library