Synerise at KDD Cup 2021: Predicting papers’ subject areas in a heterogeneous academic graph

Implementation of our solution to KDD CUP Challenge. The goal of the challenge was to predict the subject areas of papers situated in the heterogeneous graph in MAG240M-LSC dataset.

Practical Relevance: The volume of scientific publication has been increasing exponentially, doubling every 12 years. Currently, subject areas of arXiv papers are manually determined by the paper’s authors and arXiv moderators. An accurate automatic predictor of papers’ subject categories not only reduces the significant burden of manual labeling, but can also be used to classify the vast number of non-arXiv papers, thereby allowing better search and organization of academic papers.

Graph: 121M academic papers in English extracted from MAG to construct a heterogeneous academic graph. The resultant paper set is written by 122M author entities, who are affiliated with 26K institutes. Among these papers, there are 1.3B citation links captured by MAG. Each paper is associated with its natural language title and most papers’ abstracts are also available. We concatenate the title and abstract by period and pass it to a RoBERTa sentence encoder [2,3], generating a 768-dimensional vector for each paper node. Among the 121M paper nodes, approximately 1.4M nodes are arXiv papers annotated with 153 arXiv subject areas, e.g., cs.LG (Machine Learning).

Requirements

Python 3.8
Install requirments: pip install -r requirements.txt
GPU for training
SSD drive for fast reading memmap files
400 GB RAM
Download binary Cleora release. Then add execution permission to run it. Refer to cleora github webpage for more details about Cleora.

Getting Started

Steps 1-4 can be run simultaneously

Data preparation. The MAG240M-LSC dataset will be automatically downloaded if not exists to the path denoted in root.py. This takes a while (several hours to a day) in the first run, so please be patient. After decompression, the file size will be around 202GB. Please change its content accordingly if you want to download the dataset to a custom hard-drive or folder. This script creates preprocessed data that is used then during training:
- data/edges_paper_cites_paper_sorted_by_second_column.npy - numpy array with paper->cite->paper edges sorted by cited paper. Used then for fast retrieval of the papers that cited selected paper.
- data/edge_author_paper_sorted_by_paper.npy - numpy array with author->writes->paper sorted by paper. Used then for fast retrieval of all paper authors.
- data/paper2thesameauthors_papers - pickled dict that contains all other papers of the same authors as selected paper
- data/edge_author_paper_small - edges author->paper but only for authors with labelled papers (for faster searching during training)
```
python preprocessing.py
```
Estimated time of preprocessing, without downloading data: 60 minutes
Compute paper sketches from bert features using EMDE
```
python compute_paper_sketches.py
```
It creates memmap file with paper sketches: data/codes_bert_memmap

Estimated time of computing paper sketches: 105 minutes
Compute institutions sketches using Cleora and EMDE
```
python compute_institutions_sketches.py
```
It creates:
- data/inst_codes.npy - memmap file with institutions sketches
- data/paper2inst - pickled dict that contains all institutions for given paper
- data/codes_inst2id - pickled dict that maps institution to its index in data/inst_codes.npy
Estimated time of computing institutions sketches: 55 minutes
Create adjency matrix of graph with paper and author nodes
```
python create_graph.py
```
It creates data/adj.pt file that represents sparse adjency matrix

Estimated time: 60 minutes
Training model for 2 epochs
```
python train.py
```
Final model was trained with 60 ensembles python train.py --num-ensembles 60

Predictions for test set for each ensemble are saved as data/ensemble_{ensemble_id}

Two epochs training time: 40 minutes per one ensemble on Tesla V100 GPU

Inference time for all test data: 7 minutes
Merging ensemble predictions and save the test submission to file y_pred_mag240m.npz
```
python inference.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
KDDCup_Synerise.pdf		KDDCup_Synerise.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synerise at KDD Cup 2021: Predicting papers’ subject areas in a heterogeneous academic graph

Requirements

Getting Started

About

Releases

Packages

Contributors 3

Languages

License

BaseModelAI/kdd-cup-2021

Folders and files

Latest commit

History

Repository files navigation

Synerise at KDD Cup 2021: Predicting papers’ subject areas in a heterogeneous academic graph

Requirements

Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages