Railway FAQ

A Personal Health Train implementation


The Medical Data Works Railway service implements a privacy preserving federated infrastructure as proposed by the Personal Health Train. The main principle of the Personal Health Train is to bring questions to the data rather than moving data. This concept is called Federated Learning.

In this section, some Frequently Asked Questions are answered.

Definitions


The Personal Health Train metaphor defines the following components:

  • Station: An information system that contains sensitive data, such as from patients. Typically a hospital or registry provides a Station. In vantage6, this corresponds to node hosted by an organization as part of a collaboration.
  • Train: Software that encodes a question to be asked to a Station and outputs anonymous, statistical data. Typically a university or a company provides a Train.
  • Track: Software and infrastructure that allow Trains to enter Stations and ask questions. Medical Data Works B.V. provides the Track.
Below is a video that explains the Personal Health Train.

Personal Health Train on Vimeo.


What are the hardware requirements for a Station?

The hardware requirements can be found here IT requirements.

How does the Train get its data from the Station?

All Trains are based on Docker. Inside the Train (docker image) is an algorithm that needs the data. The basic requirement is therefore that the Station has to tell the Train where the data is.

This is typically done by providing the Train Docker container with the information like

  • what the file location of the CSV file is (for simple clinical data) or
  • what the SPARQL endpoint URL is (for more complex data) or
  • what the URL is of the XNAT instance (for imaging data) or
  • the connection string to a relational database for SQL execution.
Note that the above (CSV, SPARQL, XNAT, SQL) are some examples that we have encountered, it is not prescribed by the Track how to do this. Typically a project decides for themselves what the best syntax of the data is.

It is best practice to choose per project one syntax. This is because otherwise the Train has to deal with many different ways of getting the data (e.g. reading a CSV file in Station 1 and performing a SPARQL query in Station 2) and somebody (the Train provider) has to program each of these connectors. However, it's good to keep in mind that vantage6 also offers an algorithm wrapper that facilitates reading data from the following sources: 'csv', 'parquet', 'sql', 'sparql', 'excel' and 'omop'. This can be useful for writing algorithms (Trains) more comfortably. For more information see: Vantage6 Wrapper documentation.
Note that the Track does not perform any tasks in this interaction between the Station and the Train.

It is possible to establish a direct database connection between the Train and the Station's database. But remember that the database may contain data elements and data subjects that the Train does not need. The principle of data minimization of the GDPR suggests us to create a dataset that only contains the data elements that are needed to answer the specific question of the train, and this is what we see the vast majority of projects do. However, we also have supported project that do direct querying to a source database containing more information than the train needs, so in the end it is up to the consortium/project.

As an example, suppose the Station has the data of cancer patients in an OMOP instance. If for a given question only specific data elements from lung cancer patients are required , it is advised to export from this OMOP database a separate OMPO instance or a CSV file (or RDF Graph / SPARQL endpoint) with only the lung cancer patients and only those specifc data elements. It is then the location of this "datamart" that is shared by the Station to the Train rather than the location (and connector/driver) of the OMOP source database.

Does the Train execute SQL commands to the Station's database?

In general no. Most projects we support, use a dump from the database (e.g. CSV or RDF) rather than setting up a SQL query. See also the answer here. This does not mean it is impossible, but the station will need the SQL database connection URI, so that this can later be shared with the Train. We have tested such a scenario and that works. The choice is ultimately up to the consortium/project.

What is the recommended file format for the Train to communicate its (intermediate) results across the Track?

The Station will communicate its results back to the server using the HTTPS protocol. The results returned by the algorithm should be JSON serializable in the case of vantage6. Note that for some applications (such as federated deep learning) the JSON may contain many model parameters and can thus be quite big. For this reason a stable and reasonable fast internet connection is desirable.

Can different source file syntaxes be used in a given project? E.g. some station using a SQL database and others using a CSV file?

This is technically possible but puts the burden on the Train provider to sort these differences out. So it is not advisable. See also the answer here.

Is there any mechanism that stops subject level data being shared from the Station to the Track?

Yes. The most important mechanism is trust between the Stations and the Trains. What the Train can and cannot so is laid down in a mandatory project agreement. This limits the Train provider to the (research) question at hand.

Also each Train and Station provider signs an infrastructure user agreement with Medical Data Works explicitly forbidding the sharing of personal data on the Track. Individual users are - per that agreement - bound to individual terms of use (see Terms) that explicity forbid this. If Medical Data Works becomes aware that personal data has been shared on the Track it will report this is a data breach - per the Infrastructure User Agreement. Note that we have never encountered a data breach.

Besides these legal and trust mechanism, every Station can see in the log what algorithms have been executed. Node administrators are additionally encouraged to set strong policies for their nodes. They can restrict which organizations or users are allowed to send Trains to their station. More importantly, they can specify with different levels of granularity which Trains are allowed to run on their Station by their image name, including the registry, repository and tag. Potentially only allowing a registry uniquely under the node admin's control, where vetted or otherwise trusted images for algorithms are stored. Also there is the option to code review and subsequently make the Train immutable. This ensures that the Train does what it is supposed to do.

One might wonder why there are no additional technical measures on the Track, such as a limit on the volume of data being shared across the Track. In earlier versions we had this limit, but it proved unworkable as 1] for some models (e.g. deep learning) the number of parameters is large and thus the size of the file being shared on the Track is large and 2] it gives a false sense of security as the data volume of a single patient may be small and below this threshold.

Another option to prevent this from happening is using advanced cryptographic technologies like homomorphic encryption (HE) or secure Multi Party Computation (sMPC). Using this technology, the machine learning is aggregated on encrypted data, or otherwise computed while keeping the partials private. This means that even if personal data is shared on the Track, it is virtually impossible to decrypt and use for those that lack the key (HE). Some versions of this might be possible to implement (algorithms) with vantage6. However, it comes with significant drawbacks in terms of performance and is not feasible except for the simplest of questions. This is an active area of research.

Is the software code / script of the Train and/or result shared on the Track checked by someone from the Station? If so, what skillset/role should they have?

Typically this is a choice of the project and whether or not the Trains and Stations trust each other. As said in this answer, trust is still an important condition in federated learning.

If trust is lacking, the best approach is for the Stations to appoint someone to review the code of the Train. After review of the code, it is possible to ensure that the Train uses that particular version of the code.

Someone checking the code should be versed in the data and the question at hand, and also understand the specific programming language used in the Train. Medical Data Works has no a prior accepted role in this, as promising this code review would mean Medical Data Works staff have to be experts in every research domain, question and programming language which is unfeasible.

We are working on a library of trusted & certified Trains. These are Trains that have been vetted by the vantage6 community incl. Medical Data Works and will be certified to be safe. In those Trains, the researcher only needs to configure certain parameters, but cannot change the code inside the Train. This will make it easier for researchers to use federated learning and for node administrators or data owners to trust executions of such tasks on their Station.

The Station is able to read the (intermediate) results that are being shared about their Station by the Train. However, note that for some machine learning algorithms such as deep learning this is a long file of parameters which may not make sense.

Which software does each center need to download to use vantage6? E.g. do you recommend Docker Pro?

The software to be installed can be found here: IT requirements. The free version of Docker is sufficient.

What are some typical network requirements for a Node (station)?

Network requirements for a node

The node does not require open ports for incoming connections or be reachable from the Internet via a public IP. It establishes and maintains a WebSockets connection to the vantage6 server hosted by MDW. Note however that the server might only allow certain public IPs to connect to it, in which case the public IP ultimately used by the node to connect to the Internet (e.g. router's IP) will need to be communicated to MDW.

To manage the node (start, stop, create config files, etc.) at the VM/host level, the vantage6 CLI component will need to be downloaded and installed, for example via pip from PyPI.

The vantage6 node itself is packaged in a docker image, that image can be fetched from the official vantage6 docker registry at harbor2.vantage6.ai – the vantage6 CLI component will do this for you by default.

When a request for the execution of an algorithm (task) comes in for the node, the algorithm is provided as a Docker image name. This name specifies the registry from where the algorithm image containing the code should be fetched.


Therefore, a firewall would, for example, need to allow the following outgoing connections to port 443 (HTTPS):

Domain Description
example.medicaldataworks.nl Vantage6 server hosted by MDW
harbor2.vantage6.ai Official vantage6 Docker image registry hosting node image
pypi.org, files.pythonhosted.org PyPI repositories hosting vantage6 CLI tool
registry-1.docker.io, auth.docker.io, production.cloudflare.docker.com DockerHub domains hosting algorithm, if the algorithm is hosted there

This is in addition to any other requirements (e.g. APT repositories for OS updates, SSH access)