ArXiv Preprint
The need for data privacy and security -- enforced through increasingly
strict data protection regulations -- renders the use of healthcare data for
machine learning difficult. In particular, the transfer of data between
different hospitals is often not permissible and thus cross-site pooling of
data not an option. The Personal Health Train (PHT) paradigm proposed within
the GO-FAIR initiative implements an 'algorithm to the data' paradigm that
ensures that distributed data can be accessed for analysis without transferring
any sensitive data. We present PHT-meDIC, a productively deployed open-source
implementation of the PHT concept. Containerization allows us to easily deploy
even complex data analysis pipelines (e.g, genomics, image analysis) across
multiple sites in a secure and scalable manner. We discuss the underlying
technological concepts, security models, and governance processes. The
implementation has been successfully applied to distributed analyses of
large-scale data, including applications of deep neural networks to medical
image data.
Marius de Arruda Botelho Herr, Michael Graf, Peter Placzek, Florian König, Felix Bötte, Tyra Stickel, David Hieber, Lukas Zimmermann, Michael Slupina, Christopher Mohr, Stephanie Biergans, Mete Akgün, Nico Pfeifer, Oliver Kohlbacher
2022-12-07