Personalized Privacy-Preserving Framework for Cross-Silo Federated Learning

IEEE Transactions on Emerging Topics in Computing · 2024

Van-Tuan Tran, Huy-Hieu Pham, Kok-Seng Wong

VinUni-Illinois Smart Health Center (VISHC) · VinUniversity, Hanoi, Vietnam

**Figure 1.** Overall architecture of the PPPFL framework consisting of three consecutive stages. First, each participant locally synthesizes DP-guaranteed dataset D* by training their own DP-data-generator from secret dataset D. Next, in the federated training process, the synthetic data will be used to update the global model parameters θ. Lastly, the clients emerge from the federated training rounds and perform updates to their local model parameters θ* based on their secret training data.

Abstract

Federated learning (FL) is recently surging as a promising decentralized deep learning (DL) framework that enables DL-based approaches trained collaboratively across clients without sharing private data. However, in the context of the central party being active and dishonest, the data of individual clients might be perfectly reconstructed, leading to the high possibility of sensitive information being leaked. Moreover, FL also suffers from the nonindependent and identically distributed (non-IID) data among clients, resulting in the degradation in the inference performance on local clients' data.

In this paper, we propose a novel framework, namely Personalized Privacy-Preserving Federated Learning (PPPFL), with a concentration on cross-silo FL to overcome these challenges. Specifically, we introduce a stabilized variant of the Model-Agnostic Meta-Learning (MAML) algorithm to collaboratively train a global initialization from clients' synthetic data generated by Differential Private Generative Adversarial Networks (DP-GANs). After reaching convergence, the global initialization will be locally adapted by the clients to their private data.

Motivation

Cross-silo federated learning faces two critical, intertwined challenges that existing methods address only in isolation:

Challenge 1

Privacy Leakage from Dishonest Servers

In a semi-honest scenario, private information can be leaked from inversion attacks on updated gradients. In an active-and-dishonest scenario, the server can fully eavesdrop and modify the shared global model, setting up trap weights for recovering clients' data with zero reconstruction loss.

Challenge 2

Non-IID Data Distribution

In cross-silo FL settings, clients (companies, institutions, hospitals) possess vastly different sizes and partitions of private data. This statistical heterogeneity leads to a decline in the performance of the aggregated global model, especially when a single model must serve all clients.

Key Insight: PPPFL is the first framework to integrate two critical FL research areas — privacy preservation and handling non-IID data — into a single cohesive solution, providing differential privacy guarantees while achieving improved convergence and personalized performance.

PPPFL Framework

The proposed framework involves three consecutive stages:

Stage 1

Local Data Generation

Each client trains a differentially private data generator (based on DataLens) to synthesize artificial datasets. This ensures instance-level DP, providing stronger privacy guarantees than user-level DP. The original private data never leaves the client.

Stage 2

Federated Training

Clients use synthetic data for collaborative training via a stabilized MAML variant. The server meta-learns a generalized initialization θ from clients' feedback signals on synthetic data. Model EMA and Cosine Annealing Learning Rate prevent gradient instability.

Stage 3

Local Adaptation

After collaborative training, each client performs gradient descent updates on the well-initialized model using their secret private data, producing a personalized model θ* tailored to their local distribution.

Key Technical Innovations

Innovation 1

DataLens Integration for Instance-level DP

We consolidate DataLens — a state-of-the-art DP data generative model based on the PATE framework — into FL. DataLens uses multiple teacher discriminators with top-k gradient compression, stochastic sign quantization, and calibrated Gaussian noise to generate high-utility synthetic data while maintaining rigorous privacy guarantees.

Innovation 2

Stabilized MAML for Federated Training

The original MAML optimization is unstable due to multiple differentiations through the model. We propose a modified variant that employs Model Exponential Moving Average (EMA) in the inner loop and Cosine Annealing Learning Rate scheduling in the outer loop, enabling fast and stable convergence.

Innovation 3

Personalized Privacy Budgets

Clients can independently select their preferred privacy budget ε and incorporate it into the DP-GAN training. The server integrates clients' privacy budgets via a softmax-weighted objective, improving model performance while respecting individual privacy requirements.

Main Results

PPPFL is evaluated on four benchmarks: MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, with 5 clients and 30 communication rounds under non-IID settings.

Performance on DP-Synthetic Data (Server-Client FL)

Table 1 · Average performance over clients, trained on synthetic, fine-tuned and tested on secret data

Method	MNIST		FMNIST		CIFAR-10		CIFAR-100
	BMT-F1	BMTA	BMT-F1	BMTA	BMT-F1	BMTA	BMT-F1	BMTA
FedAvg	0.9223	0.9051	0.7267	0.7938	0.4898	0.6562	0.2142	0.3028
FedNova	0.8773	0.8996	0.6303	0.7607	0.3847	0.6627	0.2855	0.3374
FedProx	0.9006	0.9040	0.6994	0.7721	0.4856	0.6718	0.3014	0.3404
SCAFFOLD	0.8906	0.9024	0.6698	0.8061	0.3732	0.6614	0.2096	0.3333
FedMeta	0.9399	0.9616	0.8367	0.8800	0.6897	0.7466	0.2737	0.4133
PPPFL (ours)	0.9420	0.9640	0.8622	0.9199	0.7163	0.8000	0.3526	0.4800

PPPFL outperforms all baselines across all four datasets in both BMT-F1 and BMTA metrics, while maintaining differential privacy guarantees.

Comparison to Decentralized FL Frameworks

Table 2 · BMTA comparison with decentralized FL frameworks

Method	MNIST	FMNIST	CIFAR-10	CIFAR-100
AvgPush	0.9872	0.8900	0.6071	0.2975
ProxyFL	0.9870	0.8942	0.6292	0.3144
Dis-PFL	0.9000	0.9333	0.8380	0.2612
PPPFL (ours)	0.9640	0.9199	0.8000	0.4800

Despite using DP-synthetic data (not real private data), PPPFL achieves competitive performance and significantly outperforms on CIFAR-100, demonstrating strong utility-privacy balance.

Key Findings

Finding 1

Superior Accuracy with Privacy Guarantees

PPPFL outperforms all server-client FL baselines (FedAvg, FedNova, FedProx, SCAFFOLD, FedMeta) across all four benchmarks, achieving up to 8.0% improvement in BMT-F1 on CIFAR-100 over the second-best method.

Finding 2

Stabilized Convergence

The original MAML diverges after a few communication rounds. The stabilized MAML variant with Model EMA and Cosine Annealing LR achieves fast convergence within 10 rounds, validated on CIFAR-10.

Finding 3

Privacy-Utility Trade-off

Higher privacy budgets (ε) yield better model accuracy and synthetic data quality (lower FID). PPPFL demonstrates a strong correlation (Spearman's ρ = −0.90) between local model performance and collaboration gains, providing natural incentives for protocol participation.

Citation

If you find this work useful in your research, please consider citing:

@article{tran2024personalized,
  title     = {Personalized Privacy-Preserving Framework
               for Cross-Silo Federated Learning},
  author    = {Tran, Van-Tuan and Pham, Huy-Hieu
               and Wong, Kok-Seng},
  journal   = {IEEE Transactions on Emerging Topics
               in Computing},
  volume    = {12},
  number    = {4},
  pages     = {1014--1024},
  year      = {2024},
  publisher = {IEEE}
}

Acknowledgments

This work was supported by VinUni-Illinois Smart Health Center (VISHC), VinUniversity.