Strengths and limitations of non-disclosive data analysis: a comparison of breast cancer survival classifiers using VisualSHIELD

VisualSHIELD is an open-source, extensible web interface that simultaneously provides a standardized deployment of the DataSHIELD infrastructure and a graphical user interface to dsSwissKnife and other R packages in order to simplify the definition of an analysis workflow and the visualization of the results.

It was implemented as a Shiny module, a graphical R package that can be embedded into any user-defined Shiny app to provide the federated analysis capability. It was designed with an open-source architecture that makes it extensible and provides a clear framework for the addition of user-defined federated analyses.

The tool provides a simple graphical user interface that integrates DataSHIELD analysis methods such as

  • histograms
  • contour plots
  • heatmaps (Figure 1)
  • boxplots
  • correlation matrix
Figure 1

A novel interactive linear regression functionality was implemented in VisualSHIELD (Figure 2) by augmenting the DataSHIELD GLM functionality with statistics not available in DataSHIELD such as R2 , adjusted R2, and F-score.
Further, automatic variables conversion is achieved by adding the target type after the variable name,
separated by the ‘#’ sign.

Ex. IMP3#num

Figure 2

Further, VisualSHIELD integrates dsSwissKnife analysis methods such as

  • K-nearest neighbors
  • principal component analysis
  • randomForest

and a custom feature selection method (Figure 3).

Figure 3

We used VisualSHIELD to compare traditional machine learning methods, with equivalent methods implemented within DataSHIELD. Specifically, we trained the methods using unresticted data access, and then compared the resulting classifiers with those obtained in DataSHIELD, and found that

  • the classifiers under consideration do not generalize well when applied to unseen data
  • logistic regression method worked better on average, closely followed by random forest
  • some classifiers cannot be trained because they are disclosive of individual-level data

Based on our results, we conclude that the smaller choice of models trainable in a privacy-preserving environment has an acceptable low impact on performances, ideally compensated by the larger choice of the federated dataset researchers might have access to.

If you are going to cite or use this software, please use DOI


Tomasoni D, Lombardo R and Lauria M (2024) Strengths and limitations of non-disclosive data analysis: a comparison of breast cancer survival classifiers using VisualSHIELD.  Front. Genet. 15. doi: 10.3389/fgene.2024.1270387 link