I new to gnn and was wondering if PyG has any support for modelling heterogeneous graphs. I've seen a HinSAGE and metapath2vec in the Stellar library but I am a PyTorch diva! Any plans to integrate these? Or, maybe you could describe how I can maybe use existing blocks to create these?
Also, is there support for the dataloading process? DGL has heterogeneous graphs written on top of networkx. Is there any easy way to port these data structures over into PyG.
Any feedback helps!
To the best of my knowledge, pytorch-geometric does not have any specific prebuilt ways of dealing with heterogeneous graphs.
That being said, it's still fairly easy to implement, and I've done so previously on a power grid dataset. I'll use a simplified form of this as an example.
From memory, what you have to do is:
Data objectData's x, edge_index, and edge_attr attributes, but adding whatever attributes you need.Data object. For example,
class PowerData(Data):
def __inc__(self, key, value):
increasing_funcs = ["load_to_bus", "branch_index", "gen_to_bus"]
if key in increasing_funcs:
return len(self["bus"])
else:
return 0
increasing_funcs will be treated just like edge_index, i.e. on batching the index will increase. For example, two branch_index (my equivalent to edge_index) of [0, 1] , [1, 0] and [0, 1], [1, 0] will then be automatically batched to [0, 1, 2, 3], [1, 0, 3, 2]. (For me, this was fairly simple because load_to_bus and gen_to_bus were one-dimensional. For you, you might have to add different values to different dimensions; maybe something like return np.array([len(self["gen"]), len(self["bus"]])Data objects had bus, generator and load features, and bus_to_bus, gen_to_bus, and load_to_bus edges (with the same format as edge_index). Additionally, there were branch features (related to the bus_to_bus edges). def forward(self, data):
bus = data.bus
gen = self.subnets["gen"](torch.cat([data.gen, bus[data.gen_to_bus]], dim=-1))
load = self.subnets["load"](torch.cat([data.load, bus[data.load_to_bus]], dim=-1))
bus = self.subnets["bus"](torch.cat([
bus,
self._scatter_items(gen, data.gen_to_bus, bus.shape[0]),
self._scatter_items(load, data.load_to_bus, bus.shape[0]),
], dim=-1))
src, dest = data.branch_index
branch = self.subnets["branch"](torch.cat([bus[src], data.branch_attr, bus[dest]], dim=-1))
bus_neighbours = self.subnets["bus_and_branch"](torch.cat([bus[dest], data.branch_attr], dim=-1))
bus_neighbours = scatter_add(bus_neighbours, src, dim=0, dim_size=bus.shape[0])
bus = self.subnets["bus_and_neighbours"](torch.cat([bus_neighbours, bus], dim=-1))
data.bus = bus
data.gen = gen
data.load = load
data.branch_attr = branch
return data
Hi @fdiehl, thanks so much for your detailed response! It's taken me some time to familiarize myself with PyG so I hope you don't mind this delayed followup:
In your PowerData object, I get that you had multiple types of edges but did you have more than one node? I have a graph with two types of nodes, Person node and Address node and I'm wondering how to connect the Persons to their Addresses. I've modified a bit of a homogeneous dataset creation but I'm stuck on how to give the Person node and the Address node their own features. What would I change in the Data class? Any help appreciated!
````
class Custom(InMemoryDataset):
def __init__(self, transform=None, pre_transform=None):
super(CustomIDM, self).__init__(transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])
@property
def raw_file_names(self):
return []
@property
def processed_file_names(self):
return ['/Users/nababraham/Desktop/Github/data_50000.dataset']
def download(self):
pass
def process(self):
data_list = []
grouped = sub_df.groupby('PostalCode')
for pcode, group in tqdm(grouped):
group = group.reset_index(drop=True)
group['pcode'] = pcode
feat_cols = ['StreetName','SteetNumber'].values
person_node_feat = group.loc[group.pcode==pcode,feat_cols].values
address_node_feat = address_feat_dict[group['PostalCode'].values[0]]
node_features = torch.LongTensor(person_node_feat).float().unsqueeze(1)
source_nodes = person_dict[group.Person.values[0]]
target_nodes = address_dict[group.PostalCode.values[1:]]
edge_index = torch.tensor([source_nodes,target_nodes], dtype=torch.long)
x = node_features
#### Need to create data object with Person-LIVES_AT->Address and each node has their own features
#data = Data(x=x, edge_index=edge_index)
#data_list.append(data)
data, slices = self.collate(data_list)
torch.save((data, slices), self.processed_paths[0])
```
Let me write this as a Data object, since this makes it easier for me two wrap my head around it.
If I understood you correctly, you will have a bijective graph mapping persons to addresses, and each person and address has features (which differ in kind between the two types, obviously). I'm not sure whether a person might live at multiple addresses - if so, you could somewhat simplify the code; however, I didn't want to make that assumption.
That's something of a special case, then - from what I gather, pytorch_geometric always assumes that all entries in a key will grow the same (@rusty1s correct me if I'm wrong). That's decidedly not the case here: Stacking a data object onto a batch with 50 addresses and 100 persons should increase the first row of the edge index by 100, and the second row by 50. I don't think there's a way to do that automatically.
However!
What you can do is cheat, and create two indices. Consider the following data object:
class AddressData(Data):
def __init__(person_features, address_features, person_to_address):
self.person_features = person_features
self.address_features = address_features
self.edge_idx_person = person_to_address[0]
self.edge_idx_address = person_to_address[1]
def __inc__(self, key, value):
if key == "edge_idx_address":
return len(self.address_features)
elif key == "edge_idx_person":
return len(self.person_features)
return 0
This will store features for persons, and addresses, and will store the edge index in a disassembled form. If you have a data (actually a Batch), the mapping from person to address can be retrieved by simple concatenating the corresponding indices:
person_to_address = torch.stack([data.edge_idx_person, data.edge_idx_address])
You need to be careful that you don't change the ordering of the two edge_index attributes, but I can't think of any way that might happen accidentally.
Bonus: How do you now go and create a graph from that?
Assume that you have a person_encoder and an address_encoder, which map those two featuresets to the same feature space (p.ex. a 64-dimensional vector). Then you can map those individually, concatenate them, and glue the edge_index features together:
def forward(data):
person_encoded = self.person_encoder(data.person_features)
address_encoded = self.address_encoder(data.address_features)
x = torch.cat([person_encoded, address_encoded], dim=0)
person_to_address = torch.stack([data.edge_idx_person, data.edge_idx_address])
# correct for the offset address got when concatenating the corresponding nodes
person_to_address[1] = person_to_address[1] + len(person_encoded)
edge_index = person_to_address
Depending on your task, you might also have to build up a batch attribute.
(Note: None of the above has been tested, so there might be some errors in there)
Thanks @fgerzer for this real nice example! @rusty1s if you can offer some advice I'd be very grateful. I'm working on a link prediction task for a heterogenous graph (h and v being the two node types) where the feature vectors for h nodes are pretty huge (~2e5 for each of 40000 nodes), while those of the v nodes (3e2 each for 250 nodes) not so much. There are 3 types of edges (h2h, v2v, v2h) and the idea is to convolve v and h (using v2v and h2h respectively) separately and then cat both the outputs for a joint convolution; finally, we use the output for v2h link prediction. Something like this. I can set thresholds to control the number of edges, which are ~5e6 as of now.
The model works occasionally on CPU with memory 223G (I think it uses half or a third, not sure). I have access to GPUs but I can't seem to get past memory issues on a setup of 4 x 16G V100 GPUs (unshared). Pre-loading the entire set of feature vectors to be converted into a pyg dataset/list with a single large data object is proving challenging on different machines. I suspect it is because of pytorch's memory allocation, so I tried declaring a large empty tensor and just adding in data instead of using torch.cat since that allocates new memory each time, and a bunch of other things which unfortunately didn't seem to make a difference. Since this is just a subset of our full data, scalability is proving a major roadblock so I'm asking for your help in finding alternative solutions.
While I carefully read through the examples in PyG advanced mini-batching, the fact that I have one large graph implies node h_i can be connected to v_j 'across batches', which was why I was thinking batching isn't an option or am I missing something simple that would allow me to split a large graph into smaller batches while maintaining the 'global' edge structure (across batches) for learning?
If I am not missing anything, then the issue is I cannot load in all the feature vectors at once since it creates a huge tensor (~40G) then is there a way to deal with this large data, or do I necessarily need to reduce dimensionality in order to work with such a graph? I also noticed you built data loaders for OGB and was wondering if there's specific code I should look up to resolve this type of issue?
It seems like your major bottleneck is the feature vector dimensionality, not the number of nodes. You should be able to process this graph in a full-batch fashion with ease (given a lower feature dimensionality), but your node feature matrix should be consuming about 32GB alone.
In general, I do not think that using such a large input feature dimensionality is a good idea, and you may want to look into how you can compress them into a lower dimensionality before inputting them into a GNN, e.g., via autoencoders or autodecoders.
@rusty1s sorry for my delayed response, and thanks for reverting so quickly. Yeah, that's basically the issue. I was able to train a 4-layer GCN for a label-prediction task using these high-dimensional features and I think I can use it (minus the last layer) to act as an encoder and reduce the input feature dimensionality but I was just wondering if I am missing any obvious solution that's already present in the pytorch-geometric framework. From your response, I assume there isn't and dimensionality reduction is the way to go.
Most helpful comment
To the best of my knowledge, pytorch-geometric does not have any specific prebuilt ways of dealing with heterogeneous graphs.
That being said, it's still fairly easy to implement, and I've done so previously on a power grid dataset. I'll use a simplified form of this as an example.
From memory, what you have to do is:
DataobjectData'sx,edge_index, andedge_attrattributes, but adding whatever attributes you need.Dataobject. For example,class PowerData(Data): def __inc__(self, key, value): increasing_funcs = ["load_to_bus", "branch_index", "gen_to_bus"] if key in increasing_funcs: return len(self["bus"]) else: return 0ensures that the keys in
increasing_funcswill be treated just likeedge_index, i.e. on batching the index will increase. For example, twobranch_index(my equivalent toedge_index) of[0, 1] , [1, 0]and[0, 1], [1, 0]will then be automatically batched to[0, 1, 2, 3], [1, 0, 3, 2]. (For me, this was fairly simple becauseload_to_busandgen_to_buswere one-dimensional. For you, you might have to add different values to different dimensions; maybe something likereturn np.array([len(self["gen"]), len(self["bus"]])Dataobjects hadbus,generatorandloadfeatures, andbus_to_bus,gen_to_bus, andload_to_busedges (with the same format asedge_index). Additionally, there werebranchfeatures (related to thebus_to_busedges).