The Medical Data Works Railway service implements a privacy preserving federated infrastructure as proposed
by the Personal Health Train. The main principle of the Personal Health Train is to bring questions to the data rather than moving data.
This concept is called Federated Learning.
In this section, some Frequently Asked Questions are answered.
The Personal Health Train metaphor defines the following components:
Personal Health Train on Vimeo.
The hardware requirements can be found here IT requirements.
All Trains are based on Docker. Inside the Train (docker image) is an algorithm that needs the data.
The basic requirement is therefore that the Station has to tell the Train where the data is.
This is typically done by providing the Train Docker container with the information like
In general no. Most projects we support, use a dump from the database (e.g. CSV or RDF) rather than setting up a SQL query.
See also the answer here. This does not mean it is impossible, but the station will need the SQL database connection URI,
so that this can later be shared with the Train.
We have tested such a scenario and that works. The choice is ultimately up to the consortium/project.
The Station will communicate its results back to the server using the HTTPS protocol. The results returned by the algorithm should be JSON serializable in the case of vantage6.
Note that for some applications (such as federated deep learning) the JSON may contain many model parameters and can thus be quite big.
For this reason a stable and reasonable fast internet connection is desirable.
This is technically possible but puts the burden on the Train provider to sort these differences out. So it is not advisable. See also the answer here.
Yes. The most important mechanism is trust between the Stations and the Trains. What the Train can and cannot so is laid down in a mandatory project agreement.
This limits the Train provider to the (research) question at hand.
Also each Train and Station provider signs an infrastructure user agreement with Medical Data Works explicitly forbidding the sharing of personal data on the Track.
Individual users are - per that agreement - bound to individual terms of use (see Terms)
that explicity forbid this. If Medical Data Works becomes aware that personal data has been shared on the Track it will report this is a
data breach - per the Infrastructure User Agreement. Note that we have never encountered a data breach.
Besides these legal and trust mechanism, every Station can see in the log what algorithms have been executed. Node administrators are additionally encouraged to set strong policies for their nodes.
They can restrict which organizations or users are allowed to send Trains to their station.
More importantly, they can specify with different levels of granularity which Trains are allowed to run on their Station by their image name, including the registry, repository and tag.
Potentially only allowing a registry uniquely under the node admin's control, where vetted or otherwise trusted images for algorithms are stored.
Also there is the option to code review and subsequently make the Train immutable. This ensures that the Train does what it is supposed to do.
One might wonder why there are no additional technical measures on the Track, such as a limit on the volume of data being shared across the Track.
In earlier versions we had this limit, but it proved unworkable as 1] for some models (e.g. deep learning) the number of parameters is large
and thus the size of the file being shared on the Track is large and 2] it gives a false sense of security as the data volume of a single patient
may be small and below this threshold.
Another option to prevent this from happening is using advanced cryptographic technologies like homomorphic encryption (HE) or secure Multi Party Computation (sMPC).
Using this technology, the machine learning is aggregated on encrypted data,
or otherwise computed while keeping the partials private. This means that even if personal data is shared on the Track, it is
virtually impossible to decrypt and use for those that lack the key (HE).
Some versions of this might be possible to implement (algorithms) with vantage6.
However, it comes with significant drawbacks in terms of performance
and is not feasible except for the simplest of questions. This is an active area of research.
Typically this is a choice of the project and whether or not the Trains and Stations trust each other.
As said in this answer, trust is still an important condition in federated learning.
If trust is lacking, the best approach is for the Stations to appoint someone to review the code of the Train.
After review of the code, it is possible to ensure that the Train uses that particular version of the code.
Someone checking the code should be versed in the data and the question at hand, and also understand the specific programming language
used in the Train. Medical Data Works has no a prior accepted role in this, as promising this code review would mean Medical Data Works staff
have to be experts in every research domain, question and programming language which is unfeasible.
We are working on a library of trusted & certified Trains. These are Trains that have been vetted by the vantage6 community incl. Medical Data Works
and will be certified to be safe. In those Trains, the researcher only needs to configure certain parameters, but cannot change the code inside the Train.
This will make it easier for researchers to use federated learning and for node administrators or data owners to trust executions of such tasks on their Station.
The Station is able to read the (intermediate) results that are being shared about their Station by the Train. However, note that for some
machine learning algorithms such as deep learning this is a long file of parameters which may not make sense.
The software to be installed can be found here: IT requirements. The free version of Docker is sufficient.
Network requirements for a node
The node does not require open ports for incoming connections or be reachable from the Internet via a public IP. It establishes and maintains a WebSockets connection to the vantage6 server hosted by MDW. Note however that the server might only allow certain public IPs to connect to it, in which case the public IP ultimately used by the node to connect to the Internet (e.g. router's IP) will need to be communicated to MDW.
To manage the node (start, stop, create config files, etc.) at the VM/host level, the vantage6 CLI component will need to be downloaded and installed, for example via pip from PyPI.
The vantage6 node itself is packaged in a docker image, that image can be fetched from the official vantage6 docker registry at harbor2.vantage6.ai – the vantage6 CLI component will do this for you by default.
When a request for the execution of an algorithm (task) comes in for the node, the algorithm is provided as a Docker image name. This name specifies the registry from where the algorithm image containing the code should be fetched.
Therefore, a firewall would, for example, need to allow the following outgoing connections to port 443 (HTTPS):
Domain | Description |
---|---|
example.medicaldataworks.nl |
Vantage6 server hosted by MDW |
harbor2.vantage6.ai |
Official vantage6 Docker image registry hosting node image |
pypi.org, files.pythonhosted.org |
PyPI repositories hosting vantage6 CLI tool |
registry-1.docker.io, auth.docker.io, production.cloudflare.docker.com |
DockerHub domains hosting algorithm, if the algorithm is hosted there |