Training PPFL

To run PPFL with decentralized data on multiple machines, we use gRPC that allows clients from different platforms to seamlessly connect to the server for federated learning. This contrasts with MPI where all clients and servers should reside in the same cluster.

gRPC uses the HTTP/2 protocol. A server hosts a service specified by a URI (e.g., moonshot.cels.anl.gov:50051 where 50051 is the port number) for communication and clients send requests and receive responses via that URI. Communication protocols between a server and clients are defined via Protocol Buffers, which are defined in the appfl/protos directory. For more details, we refer to gRPC.

The API functions to run gRPC are defined as follows:

appfl.run_grpc_server.run_server(cfg: omegaconf.DictConfig, model: torch.nn.Module, loss_fn: torch.nn.Module, num_clients: int, test_data: Dataset = torch.utils.data.Dataset, metric: Any | None = None) None[source]

Launch gRPC server to listen to the port to serve requests from clients. The service URI is set in the configuration. The server will not start training until the specified number of clients connect to the server.

Parameters:
  • cfg (DictConfig) – the configuration for this run

  • model (nn.Module) – neural network model to train

  • loss_fn (nn.Module) – loss function

  • num_clients (int) – the number of clients used in PPFL simulation

  • test_data (Dataset) – optional testing data. If given, validation will run based on this data.