Datasets Used in My Papers
Plain Graphs
Name | #nodes | #edges | #labels | Type | URL |
---|---|---|---|---|---|
Youtube | 1,138,499 | 2,990,443 | 47 | undirected | [raw] [preprocessed] |
TWeibo | 2,320,895 | 50,655,143 | 100 | directed | [raw] [preprocessed] |
Orkut | 3,072,441 | 117,185,084 | 100 | undirected | [raw] [preprocessed] |
In-2004 | 1,382,908 | 16,539,643 | - | directed | [raw] [preprocessed] |
DBLP | 5,425,963 | 17,298,032 | - | undirected | [raw] [preprocessed] |
Pokec | 1,632,803 | 30,622,564 | - | directed | [raw] [preprocessed] |
LiveJournal | 4,847,571 | 68,475,391 | - | directed | [raw] [preprocessed] |
IT-2004 | 41,291,594 | 1,135,718,909 | - | directed | [raw] [preprocessed] |
41,652,230 | 1,468,365,182 | - | directed | [raw] [preprocessed] | |
Friendster | 65,608,366 | 1,806,067,135 | - | undirected | [raw] [preprocessed] |
UK-2007 | 105,896,555 | 3,738,733,648 | - | directed | [raw][preprocessed] |
UK-union | 133,633,040 | 5,475,109,924 | - | directed | [raw] [preprocessed] |
ClueWeb12 | 978,408,098 | 42,574,107,469 | - | directed | [raw] |
ClueWeb09 | 1,684,868,322 | 7,939,635,651 | - | directed | [raw] [preprocessed] |
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@article{yang13homogeneous,
title={Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank},
author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Bhowmick, Sourav S},
journal={Proceedings of the VLDB Endowment},
volume={13},
number={5},
pages={670--683},
year={2020},
publisher={VLDB Endowment}
}
@article{shi13realtime,
title={Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs},
author={Shi, Jieming and Jin, Tianyuan and Yang, Renchi and Xiao, Xiaokui and Yang, Yin},
journal={Proceedings of the VLDB Endowment},
volume={13},
number={7},
pages={966--980},
year={2020},
publisher={VLDB Endowment}
}
Attributed Graphs
Name | Type | #nodes | #edges | #attributes | #labels | URL |
---|---|---|---|---|---|---|
Wiki | directed | 2405 | 17981 | 4973 | 19 | [raw] [preprocessed] |
Cora | directed | 2708 | 5429 | 1433 | 7 | [raw] [preprocessed] |
Citeseer | directed | 3312 | 4660 | 3703 | 6 | [raw] [preprocessed] |
Pubmed | directed | 19717 | 44338 | 500 | 3 | [raw] [preprocessed] |
BlogCatalog | undirected | 5196 | 343486 | 8189 | 6 | [raw] [preprocessed] |
PPI | undirected | 56944 | 818716 | 50 | 121 | [raw] [preprocessed] |
Flickr | undirected | 7575 | 479476 | 12047 | 9 | [raw] [preprocessed] |
undirected | 4039 | 88234 | 1283 | 193 | [raw] [preprocessed] | |
directed | 81306 | 1768149 | 216839 | 4065 | [raw] [preprocessed] | |
Google+ | directed | 107614 | 13673453 | 15907 | 468 | [raw] [preprocessed] |
TWeibo | directed | 2320895 | 50655143 | 1657 | 8 | [raw] [preprocessed] |
MAG | directed | 59249719 | 978147253 | 2000 | 100 | [raw] [preprocessed] |
MAG-SC | directed | 10541560 | 265219994 | 2784240 | 8 | [raw] [preprocessed] |
Our datasets are also available in Pytorch-Geometric. Node attributes can be loaded as a sparse matrix using the following code
from scipy import sparse
features = sparse.load_npz("attrs.npz")
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@article{yang2020scaling,
title={Scaling Attributed Network Embedding to Massive Graphs},
author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Liu, Juncheng and Bhowmick, Sourav S},
journal={Proceedings of the VLDB Endowment},
volume={14},
number={1},
pages={37--49},
year={2021},
publisher={VLDB Endowment}
}
Bipartite Graphs
Name | |U| | |V| | |E| | URL |
---|---|---|---|---|
Avito | 27736 | 16589 | 67029 | [raw] [preprocessed] |
AOL | 4811647 | 1632788 | 10741954 | [raw] [preprocessed] |
DBLP | 6001 | 1524 | 29257 | [raw] [preprocessed] |
Movielens-1M | 6040 | 3706 | 1000210 | [raw] [preprocessed] |
KDDCup2012 | 255170 | 1848114 | 2766394 | [raw] [preprocessed] |
Last.fm | 359349 | 160168 | 17559531 | [raw] [preprocessed] |
Amazon-games | 826767 | 50210 | 1324754 | [raw] [preprocessed] |
DBLP | 6,001 | 1,308 | 29,256 | [raw] [preprocessed] |
Wikipedia | 15,000 | 3,214 | 64,095 | [raw] [preprocessed] |
55,187 | 9,916 | 1,500,809 | [raw] [preprocessed] | |
Yelp | 31,668 | 38,048 | 1,561,406 | [raw] [preprocessed] |
MovieLens-10M | 69,878 | 10,677 | 10,000,054 | [raw] [preprocessed] |
Last.fm | 359,349 | 160,168 | 17,559,530 | [raw] [preprocessed] |
MIND | 876,956 | 97,509 | 18,149,915 | [raw] [preprocessed] |
Netflix | 480,189 | 17,770 | 100,480,507 | [raw] [preprocessed] |
Orkut | 2,783,196 | 8,730,857 | 327,037,487 | [raw] [preprocessed] |
MAG | 10,541,560 | 2,784,240 | 1,095,315,106 | [raw] [preprocessed] |
Welcome to cite our paper if you publish results based on our preprocessed datasets.
@inproceedings{yang2022efficient,
title={Efficient and Effective Similarity Search over Bipartite Graphs},
author={Yang, Renchi},
booktitle={Proceedings of the ACM Web Conference 2022},
pages={308--318},
year={2022}
}
@inproceedings{yang2022scalable,
title={Scalable and Effective Bipartite Network Embedding},
author={Yang, Renchi and Shi, Jieming and Huang, Keke and Xiao, Xiaokui},
booktitle={Proceedings of the 2022 International Conference on Management of Data},
pages={1977--1991},
year={2022}
}
Dataset Repositories
Name | Type | Collected by |
---|---|---|
SNAP | Graphs & Networks | Stanford |
LAW | Graphs & Networks | UNIMI |
BioSNAP | Biomedical Networks | Stanford |
KONECT | Graphs & Networks | Jérôme Kunegis |
Aminer | Academic Networks | AMiner |
UCI Network Data Repository | Graphs & Networks | UCI Datalab |
Network Repository | Graphs & Networks | - |
Open Academic Graph | Academic Networks | Microsoft |
Open Graph Benchmark | Graphs & Networks | Stanford |
TuDatasets | Graphs & Networks | Christopher Morris, etc. |
StreamingGraphs | Streaming Graphs | Yibo Yao |
ARB | Graphs & Networks | Austin R. Benson |
SuiteSparse Matrix Collection | Matrix/Graphs | TAMU |
Web Data Commons | Hyperlink Graphs/Web Tables/RDFa | University of Mannheim |
Yahoo Webscope Datasets | Graphs/Ratings/Languages/Advertising | Yahoo |
UCI Machine Learning Repository | Multivariate/Text/Time-Series | UCI |
Yelp Open Dataset | businesses/reviews/user data | Yelp |
Recommender Systems Datasets | graphs/interactions/reviews/ratings | UCSD |
MIcrosoft News Dataset | user behavior logs | Microsoft |
Search Query Logs | query logs | Jeff Huang |
AOL DS | query logs | Ricardo Campos |
AWS | - | Amazon |
Kaggle Datasets | - | Kaggle |
OpenML | - | OpenML |