H2GB.datasets.PDNSDataset

class PDNSDataset(root: str, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, start: int = 0, end: int = 5, test_list: List[int] = [6], balance_gt: bool = False, domain_file: str = 'domains.csv')[source]

Bases: InMemoryDataset

PDNS is a heterogeneous cybersecurity graph of passive DNS data from the “PDNS-Net: A Large Heterogeneous Graph Benchmark Dataset of Network Resolutions for Graph Learning” paper.

The dataset is constructed from a seed set of malicious domains collected from VirusTotal and the hosting infrastructure behind these seed domains are extracted from a popular passive DNS repository that passively records most of the domain resolution occur around the world. It consists of two kinds of entities, domain node and IP node, and four types of relation, such as domain is similar to domain and domain resolve to an IP. Each domain node is associated with a 10-dimensional node feature vector extracted from pre-processed domain name, such as the number of subdomains, impersonation to a popular top brand, etc. The domain node is labeled with a binary label tagging if it is a malicious domain. We follow the official dataset splitting, where the test set is obtained over time.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)