Training PPFL
To run PPFL with decentralized data on multiple machines, we use gRPC that allows clients from different platforms to seamlessly connect to the server for federated learning. This contrasts with MPI where all clients and servers should reside in the same cluster.
gRPC uses the HTTP/2 protocol.
A server hosts a service specified by a URI (e.g., moonshot.cels.anl.gov:50051
where 50051
is the port number) for communication and clients send requests and receive responses via that URI. Communication protocols between a server and clients are defined via Protocol Buffers, which are defined in the appfl/protos
directory.
For more details, we refer to gRPC.
The API functions to run gRPC are defined as follows:
- appfl.run_grpc_server.run_server(cfg: omegaconf.DictConfig, model: torch.nn.Module, loss_fn: torch.nn.Module, num_clients: int, test_data: Dataset = torch.utils.data.Dataset, metric: Any | None = None) None [source]
Launch gRPC server to listen to the port to serve requests from clients. The service URI is set in the configuration. The server will not start training until the specified number of clients connect to the server.
- Parameters:
cfg (DictConfig) – the configuration for this run
model (nn.Module) – neural network model to train
loss_fn (nn.Module) – loss function
num_clients (int) – the number of clients used in PPFL simulation
test_data (Dataset) – optional testing data. If given, validation will run based on this data.