H2GB.datasets.OAGDataset

class OAGDataset(root: str, name: str, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None)[source]

Bases: InMemoryDataset

A variety of new heterogeneous graph benchmark datasets composed of subsets of Open Academic Graph (OAG) from “OAG: Toward Linking Large-scale Heterogeneous Entity Graphs” paper. Each of the datasets contains papers from three different subject domains – computer science (oag-cs), engineering (oag-eng), and chemistry (oag-chem). These datasets contain four types of entities – papers, authors, institutions, and field of study. Each paper is associated with a 768-dimensional feature vector generated from a pre-trained XLNet applying on the paper titles. The representation of each word in the title are weighted by each word’s attention to get the title representation for each paper. Each paper node is labeled with its published venue (paper or conference).

We split the papers published up to 2016 as the training set, papers published in 2017 as the validation set, and papers published in 2018 and 2019 as the test set. The publication year of each paper is also included in these datasets.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset (one of "ogbn-mag", "mag-year")

  • rand_split (bool, optional) – Whether to randomly re-split the dataset. This option is only applicable to mag-year. (default: False)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)