Pytorch_geometric: Heterogeneous Information Network Classification

Created on 5 Aug 2019 · 13Comments · Source: rusty1s/pytorch_geometric

Hello,

I'm interested in representing a network with multiple node types corresponding to different sets of features, where the classification task is only performed on one type of node. Was wondering if there's are any examples of similar use cases and suggestions on how I can proceed?

Can I create a mask and define the node targets to include only the type of node I would like to classify? And how would I incorporate different number of features for different types of nodes using the node feature matrix with shape [num_nodes, num_node_features]?

Thank you for your help!

Source

juliecwang

Most helpful comment

Got it, just to confirm my assumptions about the implemented operators (for the node classification task):

GAT & SAGE support node features but does not support multi-node or multi-edge types natively. They perform inductive classification (generalizes to unseen graph structures)
RGCN supports multi-edge types and node features but does not support multi-node types natively and performs transductive classification (trains and classifies on fixed graph).

I will probably end up defining my own operator to support both multi-node and multi-edge functionalities, thank you for all your help.

juliecwang on 6 Aug 2019

👍3

All 13 comments

This is a tricky problem. A general idea might be to have multiple bipartite graphs which link between your different node types, which send messages along those edges from one node type to another. This way you do not need to have an unified embedding space, but your message propagation is inherently sequential.

Another approach might be to interpret your heterogeneous graph as a single big graph, and edge types may denote the type of paired node types. To account for different node feature dimensions, you could separately pre-transform them into the same embedding space using some MLPs.

rusty1s on 5 Aug 2019

Yes, my current approach is to model the data as a single graph with multiple node and edge types. I will explore the method of pre-transforming node feature vectors into the same embedding space, thanks for the suggestion.

Is there a way to explicitly declare different node types rather than using edge types? My understanding is that R-GCN, SAGE and GAT should all be able to support multiple node and edge types? But I'm not sure if I can set up the model so that I only perform semi-supervised classification on one type of nodes.

juliecwang on 5 Aug 2019

You can just add a tensor of shape [num_nodes] to your data holding the type of the node. For semi-supervised classification, just compute the loss based on a specific node type:

out = model(...)
loss = F.nll_loss(out[data.node_type == x], data.y[data.node_type == x])

rusty1s on 6 Aug 2019

Thank you for the advice!
Just have a few more follow-up questions:

if I include multiple nodes in the data, corresponding to different sets of features, and I compute the loss based on a specific node type, would the model be able to take advantage of the heterogeneous network structure (information of other nodes as well)?
For my data, only the nodes of a specific type would have labels, in such a case can I simply have some placeholder in place for labels for the other node types?

juliecwang on 6 Aug 2019

Yes sure, that is the point of the semi-supervised setup. Information from other node types will get propagated to your final node type, and hence the model will also train the weights of all other node types.
Yes, you can do this. Alternatively, you can just save data.y[data.node_type == x] directly.

rusty1s on 6 Aug 2019

Sounds great, thank you!

juliecwang on 6 Aug 2019

Sorry just wanted to quickly follow up on unifying the embedding space for the different feature sets of each node type, do you have any specific recommendations on how to approach the problem?

juliecwang on 6 Aug 2019

Just separately transform your node features to the same dimensionality.

rusty1s on 6 Aug 2019

I might be understanding this incorrectly, but let me explain my data a bit further.

Each feature set corresponds to a different set of measurements/features, for example node type A may represent a paper and have the feature set {category, published_year, abstract} and node type b may represent an author and will have a feature set {name, age}. Each feature set will contain a mix of categorical, text and continuous numerical data. If I simply reduce the dimensional of node type A's feature set, even if the feature set dimensions are unified, they would not correspond to the same features? Again, let me know if I'm misunderstanding this.

I was thinking I can concatenate the features so that node type A and B both have the feature set {category, published_year, abstract, name, age} where the irrelevant features for each type of node are given some type of placeholder value?

juliecwang on 6 Aug 2019

I do not think it is necessary to manually design your features to fit. For each node type, convert your features to a fixed dimensionality (e.g. by using a MLP, a LSTM or whatever). After that, you can use a GNN where node features of different node types are transformed separately (in close analogy to the RGCN) and aggregated.

rusty1s on 6 Aug 2019

I see, that makes sense, thank you!

I believe GAT, SAGE or RGCN may work to process multiple node types with different feature sets? But I haven't seen any examples of these models taking in multiple node types (nor RGCN with node features) in this library. Would I need to extend the implementation of these models to work for a dataset with multiple nodes (other than including a node type tensor to my data)? Or do you have any suggestions for other GNNs that support this functionality out of the box?

juliecwang on 6 Aug 2019

You can either define your own operator or use the RGCN operator. You will definitely want to process neighboring nodes of different type differently, and operators like GAT, SAGE or GCN do not support this feature natively. IMO something like this (untested):

def forward(self, x, edge_index, node_type):
    return self.propagate(edge_index, x=x, node_type=node_type)

def message(self, x_j, node_type_j):
    # Select the correct weight matrix (like in the RGCN operator)
    weight_j = self.weight[node_type_j]
    return torch.matmul(x_j, weight_j)

def update(self, aggr_out, x):
    # Integrate central node information, e.g.:
    return self.lin(torch.cat([x, aggr_out], dim=-1))

rusty1s on 6 Aug 2019

Got it, just to confirm my assumptions about the implemented operators (for the node classification task):

GAT & SAGE support node features but does not support multi-node or multi-edge types natively. They perform inductive classification (generalizes to unseen graph structures)
RGCN supports multi-edge types and node features but does not support multi-node types natively and performs transductive classification (trains and classifies on fixed graph).

I will probably end up defining my own operator to support both multi-node and multi-edge functionalities, thank you for all your help.

juliecwang on 6 Aug 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings