Pytorch_geometric: Support for heterogeneous graphs

Created on 31 Mar 2020 · 6Comments · Source: rusty1s/pytorch_geometric

❓ Questions & Help

I new to gnn and was wondering if PyG has any support for modelling heterogeneous graphs. I've seen a HinSAGE and metapath2vec in the Stellar library but I am a PyTorch diva! Any plans to integrate these? Or, maybe you could describe how I can maybe use existing blocks to create these?

Also, is there support for the dataloading process? DGL has heterogeneous graphs written on top of networkx. Is there any easy way to port these data structures over into PyG.

Any feedback helps!

Source

nabsabraham

Most helpful comment

To the best of my knowledge, pytorch-geometric does not have any specific prebuilt ways of dealing with heterogeneous graphs.

That being said, it's still fairly easy to implement, and I've done so previously on a power grid dataset. I'll use a simplified form of this as an example.

From memory, what you have to do is:

Create a corresponding Data object
- This simply means not just using Data's x, edge_index, and edge_attr attributes, but adding whatever attributes you need.
- These will automatically be batched (the corresponding code if you define it correspondingly in your Data object. For example,
  
  class PowerData(Data): def __inc__(self, key, value): increasing_funcs = ["load_to_bus", "branch_index", "gen_to_bus"] if key in increasing_funcs: return len(self["bus"]) else: return 0
  
  ensures that the keys in increasing_funcs will be treated just like edge_index, i.e. on batching the index will increase. For example, two branch_index (my equivalent to edge_index) of [0, 1] , [1, 0] and [0, 1], [1, 0] will then be automatically batched to [0, 1, 2, 3], [1, 0, 3, 2]. (For me, this was fairly simple because load_to_bus and gen_to_bus were one-dimensional. For you, you might have to add different values to different dimensions; maybe something like return np.array([len(self["gen"]), len(self["bus"]])
- My Data objects had bus, generator and load features, and bus_to_bus, gen_to_bus, and load_to_bus edges (with the same format as edge_index). Additionally, there were branch features (related to the bus_to_bus edges).

Create a model to deal with heterogeneous graphs
- This might be as simple as having one encoding model for each node type, then concatenating node features and edge indices, or might be as complicated as having different GNNs for each of the node interactions. For example, one layer that I evaluated looked as follows:

    def forward(self, data):
        bus = data.bus
        gen = self.subnets["gen"](torch.cat([data.gen, bus[data.gen_to_bus]], dim=-1))
        load = self.subnets["load"](torch.cat([data.load, bus[data.load_to_bus]], dim=-1))
        bus = self.subnets["bus"](torch.cat([
            bus,
            self._scatter_items(gen, data.gen_to_bus, bus.shape[0]),
            self._scatter_items(load, data.load_to_bus, bus.shape[0]),
        ], dim=-1))
        src, dest = data.branch_index
        branch = self.subnets["branch"](torch.cat([bus[src], data.branch_attr, bus[dest]], dim=-1))
        bus_neighbours = self.subnets["bus_and_branch"](torch.cat([bus[dest], data.branch_attr], dim=-1))
        bus_neighbours = scatter_add(bus_neighbours, src, dim=0, dim_size=bus.shape[0])
        bus = self.subnets["bus_and_neighbours"](torch.cat([bus_neighbours, bus], dim=-1))
        data.bus = bus
        data.gen = gen
        data.load = load
        data.branch_attr = branch
        return data

fgerzer on 1 Apr 2020

👍2

All 6 comments

To the best of my knowledge, pytorch-geometric does not have any specific prebuilt ways of dealing with heterogeneous graphs.

That being said, it's still fairly easy to implement, and I've done so previously on a power grid dataset. I'll use a simplified form of this as an example.

From memory, what you have to do is:

Create a corresponding Data object
- This simply means not just using Data's x, edge_index, and edge_attr attributes, but adding whatever attributes you need.
- These will automatically be batched (the corresponding code if you define it correspondingly in your Data object. For example,
  
  class PowerData(Data): def __inc__(self, key, value): increasing_funcs = ["load_to_bus", "branch_index", "gen_to_bus"] if key in increasing_funcs: return len(self["bus"]) else: return 0
  
  ensures that the keys in increasing_funcs will be treated just like edge_index, i.e. on batching the index will increase. For example, two branch_index (my equivalent to edge_index) of [0, 1] , [1, 0] and [0, 1], [1, 0] will then be automatically batched to [0, 1, 2, 3], [1, 0, 3, 2]. (For me, this was fairly simple because load_to_bus and gen_to_bus were one-dimensional. For you, you might have to add different values to different dimensions; maybe something like return np.array([len(self["gen"]), len(self["bus"]])
- My Data objects had bus, generator and load features, and bus_to_bus, gen_to_bus, and load_to_bus edges (with the same format as edge_index). Additionally, there were branch features (related to the bus_to_bus edges).

Create a model to deal with heterogeneous graphs
- This might be as simple as having one encoding model for each node type, then concatenating node features and edge indices, or might be as complicated as having different GNNs for each of the node interactions. For example, one layer that I evaluated looked as follows:

    def forward(self, data):
        bus = data.bus
        gen = self.subnets["gen"](torch.cat([data.gen, bus[data.gen_to_bus]], dim=-1))
        load = self.subnets["load"](torch.cat([data.load, bus[data.load_to_bus]], dim=-1))
        bus = self.subnets["bus"](torch.cat([
            bus,
            self._scatter_items(gen, data.gen_to_bus, bus.shape[0]),
            self._scatter_items(load, data.load_to_bus, bus.shape[0]),
        ], dim=-1))
        src, dest = data.branch_index
        branch = self.subnets["branch"](torch.cat([bus[src], data.branch_attr, bus[dest]], dim=-1))
        bus_neighbours = self.subnets["bus_and_branch"](torch.cat([bus[dest], data.branch_attr], dim=-1))
        bus_neighbours = scatter_add(bus_neighbours, src, dim=0, dim_size=bus.shape[0])
        bus = self.subnets["bus_and_neighbours"](torch.cat([bus_neighbours, bus], dim=-1))
        data.bus = bus
        data.gen = gen
        data.load = load
        data.branch_attr = branch
        return data

fgerzer on 1 Apr 2020

👍2

Hi @fdiehl, thanks so much for your detailed response! It's taken me some time to familiarize myself with PyG so I hope you don't mind this delayed followup:
In your PowerData object, I get that you had multiple types of edges but did you have more than one node? I have a graph with two types of nodes, Person node and Address node and I'm wondering how to connect the Persons to their Addresses. I've modified a bit of a homogeneous dataset creation but I'm stuck on how to give the Person node and the Address node their own features. What would I change in the Data class? Any help appreciated!

````
class Custom(InMemoryDataset):
def __init__(self, transform=None, pre_transform=None):
super(CustomIDM, self).__init__(transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])

@property
def raw_file_names(self):
    return []
@property
def processed_file_names(self):
    return ['/Users/nababraham/Desktop/Github/data_50000.dataset']

def download(self):
    pass

def process(self):
    data_list = []

    grouped = sub_df.groupby('PostalCode')
    for pcode, group in tqdm(grouped):

        group = group.reset_index(drop=True)
        group['pcode'] = pcode 
        feat_cols = ['StreetName','SteetNumber'].values

        person_node_feat = group.loc[group.pcode==pcode,feat_cols].values
        address_node_feat = address_feat_dict[group['PostalCode'].values[0]]

        node_features = torch.LongTensor(person_node_feat).float().unsqueeze(1)            
        source_nodes = person_dict[group.Person.values[0]]
        target_nodes = address_dict[group.PostalCode.values[1:]]
        edge_index = torch.tensor([source_nodes,target_nodes], dtype=torch.long)
        x = node_features

        #### Need to create data object with Person-LIVES_AT->Address and each node has their own features
        #data = Data(x=x, edge_index=edge_index)
        #data_list.append(data)

    data, slices = self.collate(data_list)
    torch.save((data, slices), self.processed_paths[0])

```

nabsabraham on 8 Apr 2020

Let me write this as a Data object, since this makes it easier for me two wrap my head around it.

If I understood you correctly, you will have a bijective graph mapping persons to addresses, and each person and address has features (which differ in kind between the two types, obviously). I'm not sure whether a person might live at multiple addresses - if so, you could somewhat simplify the code; however, I didn't want to make that assumption.

That's something of a special case, then - from what I gather, pytorch_geometric always assumes that all entries in a key will grow the same (@rusty1s correct me if I'm wrong). That's decidedly not the case here: Stacking a data object onto a batch with 50 addresses and 100 persons should increase the first row of the edge index by 100, and the second row by 50. I don't think there's a way to do that automatically.

However!

What you can do is cheat, and create two indices. Consider the following data object:

class AddressData(Data):
    def __init__(person_features, address_features, person_to_address):
         self.person_features = person_features
         self.address_features = address_features
         self.edge_idx_person = person_to_address[0]
         self.edge_idx_address = person_to_address[1]

    def __inc__(self, key, value):
        if key == "edge_idx_address":
            return len(self.address_features)
        elif key == "edge_idx_person":
            return len(self.person_features)
        return 0

This will store features for persons, and addresses, and will store the edge index in a disassembled form. If you have a data (actually a Batch), the mapping from person to address can be retrieved by simple concatenating the corresponding indices:

person_to_address = torch.stack([data.edge_idx_person, data.edge_idx_address])

You need to be careful that you don't change the ordering of the two edge_index attributes, but I can't think of any way that might happen accidentally.

Bonus: How do you now go and create a graph from that?

Assume that you have a person_encoder and an address_encoder, which map those two featuresets to the same feature space (p.ex. a 64-dimensional vector). Then you can map those individually, concatenate them, and glue the edge_index features together:

def forward(data):
    person_encoded = self.person_encoder(data.person_features)
    address_encoded = self.address_encoder(data.address_features)
    x = torch.cat([person_encoded, address_encoded], dim=0)
    person_to_address = torch.stack([data.edge_idx_person, data.edge_idx_address])
    # correct for the offset address got when concatenating the corresponding nodes
    person_to_address[1] = person_to_address[1] + len(person_encoded)
    edge_index = person_to_address

Depending on your task, you might also have to build up a batch attribute.

(Note: None of the above has been tested, so there might be some errors in there)

fgerzer on 16 Apr 2020

Thanks @fgerzer for this real nice example! @rusty1s if you can offer some advice I'd be very grateful. I'm working on a link prediction task for a heterogenous graph (h and v being the two node types) where the feature vectors for h nodes are pretty huge (~2e5 for each of 40000 nodes), while those of the v nodes (3e2 each for 250 nodes) not so much. There are 3 types of edges (h2h, v2v, v2h) and the idea is to convolve v and h (using v2v and h2h respectively) separately and then cat both the outputs for a joint convolution; finally, we use the output for v2h link prediction. Something like this. I can set thresholds to control the number of edges, which are ~5e6 as of now.

The model works occasionally on CPU with memory 223G (I think it uses half or a third, not sure). I have access to GPUs but I can't seem to get past memory issues on a setup of 4 x 16G V100 GPUs (unshared). Pre-loading the entire set of feature vectors to be converted into a pyg dataset/list with a single large data object is proving challenging on different machines. I suspect it is because of pytorch's memory allocation, so I tried declaring a large empty tensor and just adding in data instead of using torch.cat since that allocates new memory each time, and a bunch of other things which unfortunately didn't seem to make a difference. Since this is just a subset of our full data, scalability is proving a major roadblock so I'm asking for your help in finding alternative solutions.

While I carefully read through the examples in PyG advanced mini-batching, the fact that I have one large graph implies node h_i can be connected to v_j 'across batches', which was why I was thinking batching isn't an option or am I missing something simple that would allow me to split a large graph into smaller batches while maintaining the 'global' edge structure (across batches) for learning?

If I am not missing anything, then the issue is I cannot load in all the feature vectors at once since it creates a huge tensor (~40G) then is there a way to deal with this large data, or do I necessarily need to reduce dimensionality in order to work with such a graph? I also noticed you built data loaders for OGB and was wondering if there's specific code I should look up to resolve this type of issue?

SwapneelM on 4 Sep 2020

It seems like your major bottleneck is the feature vector dimensionality, not the number of nodes. You should be able to process this graph in a full-batch fashion with ease (given a lower feature dimensionality), but your node feature matrix should be consuming about 32GB alone.

In general, I do not think that using such a large input feature dimensionality is a good idea, and you may want to look into how you can compress them into a lower dimensionality before inputting them into a GNN, e.g., via autoencoders or autodecoders.

rusty1s on 4 Sep 2020

👍1

@rusty1s sorry for my delayed response, and thanks for reverting so quickly. Yeah, that's basically the issue. I was able to train a 4-layer GCN for a label-prediction task using these high-dimensional features and I think I can use it (minus the last layer) to act as an encoder and reduce the input feature dimensionality but I was just wondering if I am missing any obvious solution that's already present in the pytorch-geometric framework. From your response, I assume there isn't and dimensionality reduction is the way to go.

SwapneelM on 9 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings