← All projects

UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

ICML 2026
Trinity College Dublin  ·  University College Dublin
maintains expert viability generates real gradients PG Universal Pseudo-Gradient Reconstructs gradients for non-activated experts DMR Dynamic Modulated Routing Rebalances expert utilization via global statistics Self-reinforcing cycle
Figure 1. The UB-SMoE self-reinforcing cycle. Universal Pseudo-Gradient (PG) reconstructs learning signals for non-activated experts, maintaining their viability. This enables Dynamic Modulated Routing (DMR) to effectively balance expert utilization, which in turn generates real gradients that further refine all experts — sustaining every expert across heterogeneous clients.
8.7×
better low-resource performance over heterogeneous LoRA-rank methods
45.0%
computation reduction on low-resource clients
0.4267
best avg. accuracy on Commonsense-15K (OLMoE-1B-7B)

Abstract

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing.

Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and a Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to 45.0% computational reduction on low-resource clients while improving their performance by 8.7× compared to existing heterogeneous LoRA-rank methods.

Motivation

Real-world federated networks comprise devices with vastly different computational budgets. A single model configuration for all devices is ineffective: we cannot exploit high-end clients if the global model is shrunk for edge devices, and a large model cannot run on low-resource clients. We need resource-adaptive fine-tuning, where models scale their capacity to match client capabilities.

Limitation of Heterogeneous LoRA-rank

Dense FFN Computation Dominates

LoRA injects low-rank updates whose cost scales as $\mathcal{O}(r_c(d{+}l))$ per layer, but this is negligible against the FFN cost of $\mathcal{O}(d{\cdot}l)$, which is unchanged regardless of rank. Heterogeneous LoRA-rank methods therefore yield only ~5% computation reduction for low-resource clients, and the merged weights $W_0 + \Delta W$ remain dense at inference — deployment latency is unchanged.

Limitation of Naive Federated SMoE

Two Optimization Discordances

SMoE offers natural resource adaptability (high-resource clients activate more experts, low-resource clients fewer), but its naive use in heterogeneous FL triggers (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Together these degrade convergence for the clients that can least afford it.

The Two Critical Discordances

Discordance 1

Expert Utilization Imbalance

Experts activated by high-resource clients receive frequent updates and become over-specialized, while those relevant to low-resource clients remain severely under-utilized. This causes a “rich-get-richer” dynamic where a few experts dominate and others collapse.

Discordance 2

Non-Differentiability of Top-K Routing

The gating function $\gamma_i(x)$ is non-zero only for selected experts, so non-activated experts receive zero gradients. Low-resource clients with high sparsity therefore leave most experts without any learning signal round after round.

Key Insight: Our convergence analysis (Theorem 4.1) proves these discordances introduce a bias in the stochastic-gradient estimator that creates an irreducible error floor in the global objective — one that scales inversely with client computational budgets, explaining why resource-constrained clients systematically underperform in federated SMoE systems.

Theoretical Foundation

We formally analyze the convergence of heterogeneous federated SMoE and show that the two discordances are not merely empirical nuisances — they impose a fundamental limit on accuracy.

Gradient Bias

Biased Sparse-MoE Gradient Estimator

Because Top-K routing zeroes out non-activated experts, the local stochastic gradient is a biased estimator of the true gradient. The bias is proportional to the mass placed on experts a client never activates — which grows with sparsity (i.e., with how resource-constrained the client is).

Theorem 4.1

Irreducible Error Floor

This bias translates into an irreducible error floor that scales inversely with the client's computation budget. Tighter budgets (fewer activated experts) ⇒ larger floor ⇒ worse convergence. This is precisely why simply plugging SMoE into heterogeneous FL fails the clients it was meant to help.

UB-SMoE Framework

UB-SMoE resolves both discordances with two complementary mechanisms that form a self-reinforcing cycle: PG maintains expert viability through approximate gradients, enabling DMR to balance utilization, which generates real gradients that further refine all experts.

Mechanism 1

Dynamic Modulated Routing (DMR)

DMR regulates expert selection using global utilization statistics through a learnable modulation vector $\boldsymbol{\phi}^{(l)}$. Rather than applying Top-K on raw affinities $\mathbf{s}^{(l)}$, it identifies a small candidate set $\mathcal{T}^{(l)}$ (top $N_p{=}2$ experts) and modulates only those logits: $m_i^{(l)} = s_i^{(l)} + \phi_i^{(l)}$ for $i \in \mathcal{T}^{(l)}$. An $L_2$-regularized $\phi$ with bounded range prevents the modulation from overriding semantic relevance, while a server-side utilization-aware update with momentum keeps utilization balanced.

Mechanism 2

Universal Pseudo-Gradient (PG)

PG approximately reconstructs gradients for non-activated experts, ensuring every expert receives a meaningful update in every round regardless of client sparsity. By feeding learning signals to dormant experts, PG keeps them viable — so that DMR has real experts to route to and balance. The global utilization rate $\tilde{u}_i^{(l)} = \sum_c p_c \, a_{c,i}^{(l)} / n_c^{(l)}$ closes the loop between client activations and server-side balancing.

Why the cycle works: PG keeps experts alive → DMR can safely promote under-utilized experts → balanced utilization produces real (not pseudo) gradients → experts are refined → viability is sustained. The two mechanisms are individually helpful but jointly far stronger than either alone (see ablation below).

Main Results

We evaluate UB-SMoE on Commonsense-15K (8 commonsense-reasoning datasets) and the TeleQuAD telecommunications question-answering benchmark, using OLMoE-1B-7B and OLMo-1B base models. Baselines span heterogeneous LoRA-rank methods (HetLoRA, FlexLoRA, FLoRA, FLoRIST) and heterogeneous sparsity methods (A$^3$SMoE, SMoE-LLB).

Commonsense-15K — SMoE Backbone (OLMoE-1B-7B)

Table 1 · Average accuracy on Commonsense-15K with OLMoE-1B-7B, averaged over all computation budgets
DatasetHetLoRAFlexLoRAFLoRAFLoRISTSMoE-LLBA³SMoEUB-SMoE
ARC-Challenge0.12840.30120.08680.09600.30800.33470.3611
ARC-Easy0.15820.41360.10040.10970.42130.47240.5017
BoolQ0.37090.45730.34720.34070.51220.43010.4952
HellaSwag0.07140.16070.06740.06050.20960.24480.2258
OpenBookQA0.12000.32250.11600.13800.33600.39250.4030
PIQA0.22780.33270.22850.23190.42440.42190.5118
Social IQa0.14520.30990.08850.08140.38010.41930.4486
WinoGrande0.27700.34450.17860.12570.44100.37330.4665
Average0.18740.33030.15170.14800.37910.38610.4267
UB-SMoE achieves the best average (0.4267), beating the strongest sparsity baseline A³SMoE by 10.5% and the strongest LoRA-rank baseline FlexLoRA by 29.1%.

Commonsense-15K — Dense Backbone (OLMo-1B)

Table 2 · Accuracy on Commonsense-15K with dense OLMo-1B; UB-SMoE evaluated at budget β₄ with matched trainable activated parameters
DatasetFlexLoRAHetLoRAFLoRAFLoRISTUB-SMoE
ARC-Challenge0.26540.21590.12120.07340.5333
ARC-Easy0.27400.21800.10770.06140.7184
BoolQ0.51010.55230.26510.19170.4697
HellaSwag0.13580.11040.19110.12940.3536
OpenBookQA0.26600.23400.10000.07800.6320
PIQA0.47770.50050.49180.49130.6931
Social IQa0.35570.33620.10290.05830.6039
WinoGrande0.47750.48540.29280.20920.5043
Average0.34530.33160.20910.16160.5636
Even against dense LoRA-rank adaptation under matched trainable parameters, UB-SMoE (0.5636) surpasses FlexLoRA by 63.2%, HetLoRA by 70.0%, FLoRA by 169.5%, and FLoRIST by 248.6%.

Performance Across Resource Budgets (the 8.7× story)

Across four computation budgets β₁–β₄, heterogeneous LoRA-rank methods collapse under tight budgets while sparsity-based methods stay effective.

Table 3 · Average accuracy on Commonsense-15K across computation budgets (heterogeneous setting)
BudgetHetLoRAFlexLoRAFLoRAFLoRISTSMoE-LLBA³SMoEUB-SMoE
β₁ (low)0.00790.04560.00940.01120.35310.36290.3936
β₂0.06990.28180.04800.05380.38470.43100.3359
β₃0.21370.53750.24970.25450.39610.40960.4716
β₄ (high)0.45800.45630.29960.27240.38240.34100.5240
Average0.18740.33030.15170.14800.37910.38610.4313
At the tightest budget β₁, UB-SMoE (0.3936) outperforms FlexLoRA (0.0456) — the strongest LoRA-rank baseline — by 8.7×, while LoRA-rank methods (HetLoRA 0.0079, FLoRA 0.0094, FLoRIST 0.0112) are essentially non-functional.

TeleQuAD — Telecommunications QA (non-IID)

BERTScore F1 on the domain-specific TeleQuAD benchmark under non-IID client data, where UB-SMoE is best at every budget.

Table 4 · TeleQuAD BERTScore F1 (×100) under non-IID data distribution
BudgetHetLoRAFlexLoRASMoE-LLBA³SMoEUB-SMoE
β₁ (low)35.8437.7344.5637.2447.66
β₂40.7644.1359.6841.7160.04
β₃44.6342.8765.9841.3366.31
β₄ (high)46.2642.1068.2342.5268.29
Average41.8741.7159.6140.7060.58
Under non-IID data, UB-SMoE improves from 47.66 at β₁ to 68.29 at β₄ and leads at every budget — demonstrating robustness to both resource constraints and data heterogeneity.

Ablation — Contribution of Each Mechanism

Table 5 · Ablation on Commonsense-15K (avg. over 8 reasoning tasks). PG = Universal Pseudo-Gradient; DMR components are φ-regularization and the utilization-aware update.
Pseudo-Gradientφ RegularizationUtilization-aware UpdateAvg. Accuracy
0.1701
0.2659
0.2839
0.3591
0.4009
0.4267
PG alone lifts accuracy from 0.1701 → 0.2659 by providing learning signals to all experts; adding the DMR components resolves the utilization imbalance, reaching 0.4267 — confirming the two mechanisms are complementary.

Scalability to 32 Clients

Table 6 · Scalability on Commonsense-15K with 32 clients under different heterogeneity patterns (avg. over budgets)
Heterogeneity PatternHetLoRAFlexLoRAA³SMoESMoE-LLBUB-SMoE
Uniform (50% participation)0.15050.20700.39980.39280.4036
Long-tail (75% low-resource)0.18320.25850.29880.29610.3047
Reverse long-tail (75% high-resource)0.22420.26010.41480.39280.4272
UB-SMoE stays ahead across client scale and resource distribution, including the challenging long-tail setting dominated by low-resource clients.

Key Findings

Finding 1

Dramatic Gains for Low-Resource Clients

Where heterogeneous LoRA-rank methods effectively fail (β₁ accuracies near 0.01), UB-SMoE stays strong (0.3936) — an 8.7× improvement over the best LoRA-rank baseline, with up to 45% computation reduction.

Finding 2

Balanced Expert Utilization

UB-SMoE achieves the highest global utilization entropy ($H \approx 6.5$) with lower variance, progressively improving balance across communication rounds — eliminating the “rich-get-richer” collapse of naive federated SMoE.

Finding 3

Robust to Scale & Heterogeneity

UB-SMoE generalizes across backbones (OLMoE-1B-7B, dense OLMo-1B), domains (commonsense reasoning, telecom), data distributions (IID & non-IID), and scales gracefully to 32 clients under uniform, long-tail, and reverse long-tail patterns.

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{tran2026ubsmoe,
  title     = {UB-SMoE: Universally Balanced Sparse
               Mixture-of-Experts for Resource-adaptive
               Federated Fine-tuning of Foundation Models},
  author    = {Tran, Van-Tuan and Nguyen-Le, Hong-Hanh
               and Ruffini, Marco and Dzaferagic, Merim},
  booktitle = {Proceedings of the International Conference
               on Machine Learning (ICML)},
  year      = {2026}
}