Implementation of our solution to KDD CUP Challenge. The goal of the challenge was to predict the subject areas of papers situated in the heterogeneous graph in MAG240M-LSC dataset.
Practical Relevance: The volume of scientific publication has been increasing exponentially, doubling every 12 years. Currently, subject areas of arXiv papers are manually determined by the paper’s authors and arXiv moderators. An accurate automatic predictor of papers’ subject categories not only reduces the significant burden of manual labeling, but can also be used to classify the vast number of non-arXiv papers, thereby allowing better search and organization of academic papers.
Graph: 121M academic papers in English extracted from MAG to construct a heterogeneous academic graph. The resultant paper set is written by 122M author entities, who are affiliated with 26K institutes. Among these papers, there are 1.3B citation links captured by MAG. Each paper is associated with its natural language title and most papers’ abstracts are also available. We concatenate the title and abstract by period and pass it to a RoBERTa sentence encoder [2,3], generating a 768-dimensional vector for each paper node. Among the 121M paper nodes, approximately 1.4M nodes are arXiv papers annotated with 153 arXiv subject areas, e.g., cs.LG (Machine Learning).
- Python 3.8
- Install requirments:
pip install -r requirements.txt
- GPU for training
- SSD drive for fast reading memmap files
- 400 GB RAM
- Download binary Cleora release. Then add execution permission to run it. Refer to cleora github webpage for more details about Cleora.
Steps 1-4 can be run simultaneously
-
Data preparation. The
MAG240M-LSC
dataset will be automatically downloaded if not exists to the path denoted inroot.py
. This takes a while (several hours to a day) in the first run, so please be patient. After decompression, the file size will be around 202GB. Please change its content accordingly if you want to download the dataset to a custom hard-drive or folder. This script creates preprocessed data that is used then during training:data/edges_paper_cites_paper_sorted_by_second_column.npy
- numpy array with paper->cite->paper edges sorted by cited paper. Used then for fast retrieval of the papers that cited selected paper.data/edge_author_paper_sorted_by_paper.npy
- numpy array with author->writes->paper sorted by paper. Used then for fast retrieval of all paper authors.data/paper2thesameauthors_papers
- pickled dict that contains all other papers of the same authors as selected paperdata/edge_author_paper_small
- edges author->paper but only for authors with labelled papers (for faster searching during training)
python preprocessing.py
Estimated time of preprocessing, without downloading data: 60 minutes
-
Compute paper sketches from bert features using EMDE
python compute_paper_sketches.py
It creates memmap file with paper sketches:
data/codes_bert_memmap
Estimated time of computing paper sketches: 105 minutes
-
Compute institutions sketches using Cleora and EMDE
python compute_institutions_sketches.py
It creates:
data/inst_codes.npy
- memmap file with institutions sketchesdata/paper2inst
- pickled dict that contains all institutions for given paperdata/codes_inst2id
- pickled dict that maps institution to its index indata/inst_codes.npy
Estimated time of computing institutions sketches: 55 minutes
-
Create adjency matrix of graph with paper and author nodes
python create_graph.py
It creates
data/adj.pt
file that represents sparse adjency matrixEstimated time: 60 minutes
-
Training model for 2 epochs
python train.py
Final model was trained with 60 ensembles
python train.py --num-ensembles 60
Predictions for test set for each ensemble are saved as
data/ensemble_{ensemble_id}
Two epochs training time: 40 minutes per one ensemble on Tesla V100 GPU
Inference time for all test data: 7 minutes
-
Merging ensemble predictions and save the test submission to file
y_pred_mag240m.npz
python inference.py