Hello,
I'm interested in representing a network with multiple node types corresponding to different sets of features, where the classification task is only performed on one type of node. Was wondering if there's are any examples of similar use cases and suggestions on how I can proceed?
Can I create a mask and define the node targets to include only the type of node I would like to classify? And how would I incorporate different number of features for different types of nodes using the node feature matrix with shape [num_nodes, num_node_features]?
Thank you for your help!
This is a tricky problem. A general idea might be to have multiple bipartite graphs which link between your different node types, which send messages along those edges from one node type to another. This way you do not need to have an unified embedding space, but your message propagation is inherently sequential.
Another approach might be to interpret your heterogeneous graph as a single big graph, and edge types may denote the type of paired node types. To account for different node feature dimensions, you could separately pre-transform them into the same embedding space using some MLPs.
Yes, my current approach is to model the data as a single graph with multiple node and edge types. I will explore the method of pre-transforming node feature vectors into the same embedding space, thanks for the suggestion.
Is there a way to explicitly declare different node types rather than using edge types? My understanding is that R-GCN, SAGE and GAT should all be able to support multiple node and edge types? But I'm not sure if I can set up the model so that I only perform semi-supervised classification on one type of nodes.
You can just add a tensor of shape [num_nodes] to your data holding the type of the node. For semi-supervised classification, just compute the loss based on a specific node type:
out = model(...)
loss = F.nll_loss(out[data.node_type == x], data.y[data.node_type == x])
Thank you for the advice!
Just have a few more follow-up questions:
data.y[data.node_type == x] directly.Sounds great, thank you!
Sorry just wanted to quickly follow up on unifying the embedding space for the different feature sets of each node type, do you have any specific recommendations on how to approach the problem?
Just separately transform your node features to the same dimensionality.
I might be understanding this incorrectly, but let me explain my data a bit further.
Each feature set corresponds to a different set of measurements/features, for example node type A may represent a paper and have the feature set {category, published_year, abstract} and node type b may represent an author and will have a feature set {name, age}. Each feature set will contain a mix of categorical, text and continuous numerical data. If I simply reduce the dimensional of node type A's feature set, even if the feature set dimensions are unified, they would not correspond to the same features? Again, let me know if I'm misunderstanding this.
I was thinking I can concatenate the features so that node type A and B both have the feature set {category, published_year, abstract, name, age} where the irrelevant features for each type of node are given some type of placeholder value?
I do not think it is necessary to manually design your features to fit. For each node type, convert your features to a fixed dimensionality (e.g. by using a MLP, a LSTM or whatever). After that, you can use a GNN where node features of different node types are transformed separately (in close analogy to the RGCN) and aggregated.
I see, that makes sense, thank you!
I believe GAT, SAGE or RGCN may work to process multiple node types with different feature sets? But I haven't seen any examples of these models taking in multiple node types (nor RGCN with node features) in this library. Would I need to extend the implementation of these models to work for a dataset with multiple nodes (other than including a node type tensor to my data)? Or do you have any suggestions for other GNNs that support this functionality out of the box?
You can either define your own operator or use the RGCN operator. You will definitely want to process neighboring nodes of different type differently, and operators like GAT, SAGE or GCN do not support this feature natively. IMO something like this (untested):
def forward(self, x, edge_index, node_type):
return self.propagate(edge_index, x=x, node_type=node_type)
def message(self, x_j, node_type_j):
# Select the correct weight matrix (like in the RGCN operator)
weight_j = self.weight[node_type_j]
return torch.matmul(x_j, weight_j)
def update(self, aggr_out, x):
# Integrate central node information, e.g.:
return self.lin(torch.cat([x, aggr_out], dim=-1))
Got it, just to confirm my assumptions about the implemented operators (for the node classification task):
GAT & SAGE support node features but does not support multi-node or multi-edge types natively. They perform inductive classification (generalizes to unseen graph structures)
RGCN supports multi-edge types and node features but does not support multi-node types natively and performs transductive classification (trains and classifies on fixed graph).
I will probably end up defining my own operator to support both multi-node and multi-edge functionalities, thank you for all your help.
Most helpful comment
Got it, just to confirm my assumptions about the implemented operators (for the node classification task):
GAT & SAGE support node features but does not support multi-node or multi-edge types natively. They perform inductive classification (generalizes to unseen graph structures)
RGCN supports multi-edge types and node features but does not support multi-node types natively and performs transductive classification (trains and classifies on fixed graph).
I will probably end up defining my own operator to support both multi-node and multi-edge functionalities, thank you for all your help.