We've spoken a lot at Science: Disrupt about what science as an industry could do to improve, and these discussions typically revolve around three themes: openness, accessibility and effective communication. We chatted to OpenML, an organisation that’s attempting to bridge the gap between the experts and the casual users, by making state of the art machine learning processes easily accessible and sharable.
OpenML was borne out of Joaquin Vanschoren's (Assistant Professor of Machine Learning at TU Eindhoven) desire to share the experiments he was running during his PhD (and as we are finding, a dissatisfaction with how research is disseminated is a powerful motivator to become a founder). Shifting such experiments away from the unactionable ‘numbers in a table’ style was paramount. The development of OpenML has been driven by volunteer work, taking place at workshops, and in the spare time of a number of researchers, such as Heidi Seibold (a Computational Biostatistics PhD student at The University of Zurich) wanted to get involved after hearing about an OpenML workshop (you can check out her perspectives on OpenML here).
There are a number of code repositories widely used in research, however what sets OpenML apart (beyond the obvious focus on Machine Learning) is the linking of repositories and their respective algorithms and datasets. Data no longer lives on inaccessible islands, and maximising accessibility has clearly been a focus during the build. With OpenML, the user can not only access any and all datasets on the platform but have access to the algorithms that run on that data.
Many in the Machine Learning research community are eager to offer up their data and algorithms online which of course is a great symbolic gesture in the context of pushing Open Science and reproducibility forwards. However, there’s a difference in reproducibility in practice and in principle as Bernd Bischl (Professor of Computational Statistics at LMU Munich) pointed out. In principle it’s all well and good having a paper or lab microsite that outlines an algorithm in detail, however in practice the building and deployment of the program from the description is an unnecessary time sink for the PhD researcher assigned to the project. It is the illusion of Open Access, an admirable attempt but something that can be improved upon, this is where OpenML steps in. The platform would also be essentially impenetrable if the data on the site had their own formatting, but OpenML takes care of the pre-processing upon upload – as such all data is machine readable meaning the user can actually, well, use it.
So who is OpenML targeting? The core users at the moment are the machine learning developers, those testing the algorithms, however as Heidi was keen to point out it’s the domain specialists they’re eager to grab. After all, they are the ones with the interesting data and the big questions. Machine learning is already regarded as a powerful utility for a researcher's arsenal but an awareness of the pre-processing of datasets and the appropriate algorithms to use, are the rate limiting step for the uptake of machine learning in fields that could flourish with these new techniques. Bernd noted that Machine Learning as a field is maturing, from something that was deemed too mathsey, to a field that is generating huge interest within companies and the public. OpenML is designed to help guide those who want to be involved but do not require the knowledge bank of complex statistical methods. Joaquim’s goal with OpenML is that those without specialised knowledge in Machine Learning can retrieve a dataset and with a line of code run many algorithms on that data, and the results appear online for the ML community to weigh in. The team are looking to emulate the success that simple initiatives such as Tim Gowers’ Polymath Project have had with fostering collaboration in traditionally isolated and complex domains of science.
Ok, so researchers typically unfamiliar with these techniques can now easily deploy complex routines on their datasets with rapid feedback from Machine Learning experts on the platform. But one thing that can hold back research transparency is the protection of their IP, and as such, a paper trail back to the researcher must be available. OpenML tackles this challenge in a number of ways. All data and all results made available on OpenML will have an attribution included within the upload process, and authors can select the license they desire. Even objects that aren’t traditionally citable could be collected as part of an ‘online study’ that would be assigned an ID, and a citation would be autogenerated when writing a paper.
The future looks bright for OpenML, and with the Dutch Data Prize under their belts, it’s clear they’re finally building a level of recognition appropriate to a venture such as this. We’re excited to see how OpenML develops and would encourage any researcher who has wondered about the kind of impact Machine Learning could have on their research to give the platform a a shot!