The Federated Compute (FC) Server is part of Federated Learning offered by On-Device Personalization (ODP). The purpose of this document is to introduce the Federated Compute Server (FC Server), its components, and the technology used. The document provides a high-level overview of the architecture and then dives into each component in detail. It also discusses how the components work together to provide a federated learning environment, and offers strategies for scaling and sharding workloads.
Training flow
Training consists of data flows between the FC Client and FC Server. The FC Client is a core Android module that trains ML models on-device and interacts with the FC Server. The FC Server processes and aggregates the results from the FC Client securely in a Trusted Execution Environment (TEE).
Training consists of the following steps:
- The FC Client on the device downloads a public encryption key from the Key Services.
- The FC Client checks in with the FC Server and gets a training task.
- The FC Client downloads a training plan, plus the latest version of the model, version N.
- The FC Client trains using the local data and the plan.
- The FC Client encrypts this device's contributions with the public key obtained in Step 0 and uploads it to the FC Server.
- The FC Client notifies the FC Server that its training has completed.
- The FC Server waits until enough clients have submitted their contributions.
- A round of aggregation is triggered.
- Encrypted contributions are loaded into a Trusted Execution Environment (TEE) by the Aggregator.
- The Aggregator attests itself, following NIST's RFC 9334 Remote ATtestation procedureS (RATS) Architecture, to the coordinators. Upon successful attestation, the Key Services grant it the decryption keys. These keys may be split across multiple key providers in a Shamir secret sharing scheme.
- The Aggregator does cross-device aggregation, clips and noises per appropriate Differential Privacy (DP) mechanisms, and outputs the noised results.
- The Aggregator triggers the Model Updater.
- The Model Updater loads the aggregated contribution and applies it to model version N to create model version N + 1. The new model is pushed to model storage.
The FC Server can be deployed on any cloud service(s) which support TEEs and related security features. We are evaluating public cloud providers and underlying technologies, but for now the following section presents a Google Cloud example implementation using Confidential Space.
High-level architecture
The FC Server has the following components deployed in Google Cloud:
Component | Description |
Task Management Service | A web service for managing the training task. Partners should use the Task Management API to create a training task, list all existing training tasks, cancel a task, and retrieve all training statuses. |
Task Assignment Service | An HTTPS based web service where the client devices periodically check in to obtain training tasks and report training status. |
Aggregator | A background service running in Confidential Space. It runs ODP-authored workloads. It must attest to coordinators, who guardrail access to the decryption keys. Only successfully attested Aggregators can decrypt contributions submitted by client devices and carry out cross-device aggregation. |
Model Updater | A background service running in Confidential Space that applies the aggregated gradients to the model. |
Component details
The following sections expand the high-level architecture into further details:
Task Management Service
The Task Management Service contains two sub components: the Task Management Web Service and the Task Scheduler Service, both deployed on GKE.
Task Management
This is a set of frontend web services that take in HTTPS requests and create or get tasks from the Task Database.
Task Scheduler
A background service that continuously scans the Task Database. It manages the training flow, for example creating new training rounds and iterations.
Task Database
An ANSI SQL-compliant database that stores the Task, Iteration, and Assignment information. In this implementation, Google Cloud Spanner is used as the underlying database service.
Task Assignment Service
The Task Assignment Service is a frontend web service that is hosted on GKE. It takes in requests from the FC Clients and distributes training tasks when applicable.
The Task Database here is the same database instance as the Task Database in Task Management Service.
Aggregator Service
Aggregator and Model Updater
The Aggregator and Model Updater are similar. They are background services that process data securely in Confidential Space. Communication between the offline jobs is through PubSub.
Gradients, aggregated gradients, model and plan
- A gradient storage for client device uploaded (encrypted) gradients.
- An aggregated gradient storage for aggregated, clipped and noised gradients.
- A model and plan storage for the training plans, models and weights.
Collector
The Collector is a background service that periodically counts the client device submissions during a training round. It notifies the Aggregator to kick off aggregation once enough submissions are available.
Service hosts
All services that don't have access to sensitive information are hosted on GKE.
All services that may touch sensitive information are hosted in Confidential Space.
All sensitive data are encrypted with encryption keys managed by multiple party owned Key Services. Only successfully attested, ODP-authored open source code running in legitimate, confidential computing enabled versions of Confidential Space can access the decryption keys.
In one service unit, the compute resource looks like this:
Scalability
The previously described infrastructure focuses on one service unit.
One service unit uses one Cloud Spanner. See Spanner Quotas & limits for notable limitations.
Each component of this architecture can be scaled independently. This is done by scaling capacity within either the Confidential Space or within the GKE cluster using standard scaling mechanisms. Effectively, the processing capacity can be increased by adding more instances of:
- Task Assignment Web Service
- Task Management Web Service
- Aggregator Instances
- Model Updater Instances
Resilience
An FC Server's resilience is handled by disaster recovery using replicated storage. If you are interested in disaster recovery, you should enable cross-region data replication. This will ensure that if a disaster happens (like a weather event disrupting a data center), the service will resume from the last round of training.
Spanner
The FC Server's default implementation uses Google Cloud Spanner as the database to store task status that is used to control the training flow. You should evaluate the tradeoffs between consistency and availability according to your business needs before selecting a multiple region configuration.
There is no user data or its derivatives, either raw or encrypted, stored in any Spanner instance. Feel free to use any of the available disaster recovery features offered by Spanner.
Spanner records the change history. The Aggregator and Model Updater store the data per training round and each round's result is stored separately without overwriting each other. Because of this, the service can resume from the last round of training in the event of a disaster.
Google Cloud Storage
The FC Server's default implementation uses Google Cloud Storage to store blob data such as models, training plans, and encrypted device contributions.
There are three GCS instances in the design:
- Device contributions: encrypted device contributions uploaded from devices.
- Models: training plans, models and their weights.
- Aggregated Gradients: The aggregated gradients produced by the aggregator.
The data stored in GCS are either:
- Developer provided data, such as a training plan OR
- Potentially private data because they are derived from user signals (protected by multiple-coordinator backed encryption) such as device uploaded gradients and aggregated gradients OR
- Non-private data derived from user signals but post Differential Privacy application, like model weights.
You should evaluate the tradeoffs between consistency and availability and select proper GCS data availability and durability features. You should specify your own data retention policies.
Replication and backups
Apart from the data replication mechanisms provided by Google Cloud, you may also choose to periodically backup the data in Spanner and GCS. For example, you can use cross-cloud replication services and offerings. ODP doesn't provide a sample because these configurations are highly business needs dependent. The current design takes into consideration developers' potential needs for such replications and backups. As a result, it is compatible with third party provided replication and backup services and products.