Pysyft: Error in Federated Learning when Batch Normalization layer is used

Created on 13 Aug 2019 · 17Comments · Source: OpenMined/PySyft

Hi, I am trying to replicate the tutorial (Part 06 - Federated Learning on MNIST using a CNN.ipynb) with CIFAR10 dataset and with a network model I defined myself. However, it gives an error like this when it tries to execute nn.BatchNorm2d() function:

RuntimeError: running_mean should contain 3 elements not [32]

If I don't use batch normalization layers in my model, it works fine. If I use batch normalization layer without any federated learning approach, it also works perfectly. I need to use batch normalization layer to get a good accuracy and also I need to work this example with ResNet (also having batch normalization layer), so I am not sure how to work around this.

I am using google colab netbook with the installation example you gave (and also tested with my own computer with Python3.6 and PyTorch 1.1.0)

The example code and the related error is below:


from __future__ import print_function
import argparse
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.__version__

from torchvision import datasets, transforms
from torch.autograd import Variable


import syft as sy  # <-- NEW: import the Pysyft library
hook = sy.TorchHook(torch)  # <-- NEW: hook PyTorch ie add extra functionalities to support Federated Learning
bob = sy.VirtualWorker(hook, id="bob")  # <-- NEW: define remote worker bob
alice = sy.VirtualWorker(hook, id="alice")  # <-- NEW: and alice


class Arguments():
    def __init__(self):
        self.batch_size = 64
        self.test_batch_size = 1000
        self.epochs = 10
        self.lr = 0.01
        self.momentum = 0.5
        self.no_cuda = False
        self.seed = 1
        self.log_interval = 30
        self.save_model = False

args = Arguments()

use_cuda = not args.no_cuda and torch.cuda.is_available()

torch.manual_seed(args.seed)

device = torch.device("cuda" if use_cuda else "cpu")

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

federated_train_loader = sy.FederatedDataLoader( # <-- this is now a FederatedDataLoader 
    datasets.CIFAR10('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.5,), (0.5,), (0.5,))
                   ]))
    .federate((bob, alice)), # <-- NEW: we distribute the dataset across all the workers, it's now a FederatedDataset
    batch_size=args.batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
    datasets.CIFAR10('../data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.5,), (0.5,), (0.5,))
                   ])),
    batch_size=args.test_batch_size, shuffle=True, **kwargs)

class Net(nn.Module):
    def __init__(self, dropout=0.0):
        super(Net, self).__init__()

        self.dropout = dropout
        self.conv_layer = nn.Sequential(

            # Conv Layer block 1
            nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Conv Layer block 2
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout2d(p=0.05),  # 0.05

            # Conv Layer block 3
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        self.fc_layer = nn.Sequential(
            nn.Dropout(p=0.1),  # 0.1
            nn.Linear(4096, 1024),
            nn.ReLU(inplace=True),
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.1),  # 0.1
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.conv_layer(x)
        x = x.view(x.size(0), -1)
        x = self.fc_layer(x)

        return x

def train(args, model, device, federated_train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(federated_train_loader): # <-- now it is a distributed dataset
        model.send(data.location) # <-- NEW: send the model to the right location
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()
        model.get() # <-- NEW: get the model back
        if batch_idx % args.log_interval == 0:
            loss = loss.get() # <-- NEW: get the loss back
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * args.batch_size, len(federated_train_loader) * args.batch_size,
                100. * batch_idx / len(federated_train_loader), loss.item()))

def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.cross_entropy(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(1, keepdim=True) # get the index of the max log-probability 
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=args.lr) # TODO momentum is not supported at the moment

for epoch in range(1, args.epochs + 1):
    train(args, model, device, federated_train_loader, optimizer, epoch)
    test(args, model, device, test_loader)

if (args.save_model):
    torch.save(model.state_dict(), "CIFAR10_cnn.pt")

The complete output (from google colab):


WARNING: Logging before flag parsing goes to stderr.
W0813 08:46:34.190028 140520788572032 secure_random.py:26] Falling back to insecure randomness since the required custom op could not be found for the installed version of TensorFlow. Fix this by compiling custom ops. Missing file was '/usr/local/lib/python3.6/dist-packages/tf_encrypted/operations/secure_random/secure_random_module_tf_1.14.0.so'
W0813 08:46:34.211576 140520788572032 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tf_encrypted/session.py:26: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

0it [00:00, ?it/s]

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ../data/cifar-10-python.tar.gz

170500096it [00:04, 40609204.86it/s]                               

---------------------------------------------------------------------------

PureTorchTensorFoundError                 Traceback (most recent call last)

/content/PySyft/syft/frameworks/torch/tensors/interpreters/native.py in handle_func_command(cls, command)
    300             new_args, new_kwargs, new_type, args_type = syft.frameworks.torch.hook_args.unwrap_args_from_function(
--> 301                 cmd, args, kwargs, return_args_type=True
    302             )

16 frames

PureTorchTensorFoundError: 


During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   1695     return torch.batch_norm(
   1696         input, weight, bias, running_mean, running_var,
-> 1697         training, momentum, eps, torch.backends.cudnn.enabled
   1698     )
   1699 

RuntimeError: running_mean should contain 3 elements not 32

One possible workaround for this problem is to divide the net and use batch norm in client side. However, I do not want to rewrite big ResNet architectures and make the code more complex.

I hope you can help me with this issue.

Type

Source

bussfromspace

All 17 comments

b-sayah on 20 Aug 2019

Williamongh on 24 Sep 2019

hayreenlee on 20 Nov 2019

anmakon on 2 Dec 2019

+10086

codergan on 4 Jan 2020

👍1

flo257 on 25 Feb 2020

meryemJanatiIdrissi on 19 Mar 2020

Is this bug solve? I have the same problem
even use the pytorch resnet (model= torchvision.models.resnet34(pretrained=False).to(device))can't train

johnnylin110 on 21 Mar 2020

So the answer here is that BatchNormalization has internal statistics (means and stds, etc.) which themselves contain private information but are not parameters.

This means that when you move a model from one machine to another (iterating through model.parameters()), those statistics probably aren't moving with it.

Solution: extend nn.Module() with a .parameters_and_statistics() iterator which will look for this kind of information so that we can make sure that federated learning properly moves everything from machien to machine.

iamtrask on 21 Mar 2020

I tried re creating this issue but it did not occur, So I dug a bit into the BatchNorm.

here I could see these running statistics are being able to be registered as parameters or states.
which extends to these lines if it is just a buffer
def register_buffer(self, name, tensor):

But I suspect either way these are now taken care by syft in moving.

So should we still look into this issue https://github.com/OpenMined/PySyft/issues/3236

ratmcu on 23 Mar 2020

@ratmcu
I use the code above , and still this issue happen , can you please tell what your code is ?
Because in my code and the code above both get this problem , and when i change the net without batchnorm2d , the error is gone . So I think this issue is still here , or someone can try it?

johnnylin110 on 23 Mar 2020

@johnnylin110 what is the pytorch version you are using? Can you check your source to see if it is the same as I mentioned? colab notebook here has the code, It runs on pytorch '1.4.0'

ratmcu on 23 Mar 2020

@ratmcu I think my syft version is the old one , and I udpate it to the new version which use pytorch1.4.0 . This issue is solve . Thanks for helping !

johnnylin110 on 24 Mar 2020

👍1

@bussfromspace @karlhigley This issue seems to be solved. Can you close it ?

sachin-101 on 25 Mar 2020

👍1

@ratmcu Hi again, I am looking at the code you paste, the different between is you modify the forward part into

if x.location != None:
            print(x.location)
            loc = x.location
            x = x.get()
            print(x.sixe(0))
            x = x.view(x.size(0), -1)
            x = x.send(loc)
        else:
            x = x.view(x.size(0), -1)
        x = self.fc_layer(x)

however , this network i try it on pysyft get 10% accuracy only , but in Pytorch version, this can achieve high accuracy , and if use the same code from above(without the part you modifiy) , the pysyft will pop out error message about "shape [0,-1] is invalid for input of size XXXX pytorch"
Is this still a bug here?

johnnylin110 on 30 Mar 2020

@johnnylin110 this is a separate bug due to not being able to call .size() method remotely, batchnorm layer running remotely was the goal. so it was running in this example.

ratmcu on 30 Mar 2020

@ratmcu I know there is a bug so need to call .size() method here, but still, why the accuracy of the same approach(same network) will lead different ?
In my Pytorch CIFAR with the net above is 65% at final , but in pysyft version it only 10% at the end(even when worker set to 1) . Why will this happen?

Thanks for reply ! very appreciate!

johnnylin110 on 31 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings