Mass Spectrometry Development Kit (MSDK; http://msdk.github.io ) is a Java library of algorithms for processing mass spectrometry data. It provides a flexible data model with Java interfaces for mass-spectrometry related objects, including raw spectra, processed data sets, identifications, etc. The goal of MSDK is to integrate multiple existing algorithms that are currently scattered around various Java-based graphical mass spectrometry tools (MZmine, Maltcms, MassCascade, Guineu, mzMatch, OpenChrom, mzJava, and others).
There are many file formats in use for mass spectrometry data. For interoperability of algorithms in Java-based processing workflows, there was an urgent need for a high-performance Java library that would provide support for reading and writing these common formats. In the past, several projects have developed custom parsers for various formats. However, none of the existing libraries provide a high-performance, random access to the data.
As a part of my Google Summer of Code 2017 project, I along with the help of my mentors Tomáš Pluskal, Dmitry Avtonomov and Adam Tenderholt, built native Java parsers with the ability to read using random access from the files of formats - mzML, mzXML and netCDF. As a part of this project, we also added functionality to the API to write to these formats which enables us to interconvert the data from one format to another.
The pull requests opened over the project, and the relevant discussion before they were merged can be found here - Pull Requests opened during the project
MSDK's mzML IO package which existed before the start of this project used a library called jmzML to parse the data. However, it could only read the data sequentially and was very slow.
We, started working on a new mzML parser which could make up for the drawbacks of the existing one. The new parser converts the give input mzML data into a stream, and the metadata is parsed first alongside, the position of the intensity, m/z and retention times arrays are noted. Later, this data is further parsed on-demand. All these operations are done without converting the stream to save computational time and memory.
A working mzML file writer which writes out indexed mzML files along with the checksum can also be found in this package
The old mzXML parser present in the MSDK used a different parsing algorithm. A new parser which reads the data as a stream was used and added to the package.
Random access support was added to the existing netCDF parser and a new netCDF writer was built and added to the package.
As a part of this project, we had to make a few changes to the XML parsing and writing API itself to suit our needs. You can find the API here - Javolution MSFTBX.
The existing Javolution XML writing API doesn't have the ability to get the location in the file it is writing to. Since, it is important to track the current position to store the indices of the scans, we added fields to the XML writer to track the location, which enabled us to support exporting of indexed mzML files.
Aapart from the test cases included in the main MSDK project, we setup a separate repository with big data sets to allow testing with a higher number of files. The data parsed by the new parsers was asserted by comparing it with the actual value found in the file or the results obtained by old parser. The JUnit test cases coverage results are as follows -
Benchmark comparisions for the old MzML parser vs the new MzML parser was conducted. The runtimes correspond to that of the same JUnit tests and was averaged over 10 runs. The results are as follows -
|Test File||Runtime with old parser (ms)||Runtime with new parser (ms)|
|MzMLFile_7.mzml (Compressed and uncompressed)||58||8|
- IO modules for the following formats -
- IO modules for the following feature list formats -
- Multi-threading support for existing modules
- Support for raw vendor formats
- On-disk caching for better performance