Synerise at KDD Cup 2021: The paper citation challenge

Implementation of our solution to KDD CUP Challenge. The goal of the challenge is to predict the subject areas of papers situated in the heterogeneous graph in MAG240M-LSC dataset.

Requirements

Python 3.8
Install requirments: pip install -r requirements.txt
GPU for training
SSD drive for fast reading memmap files
400 GB RAM
Download binary Cleora release. Then add execution permission to run it. Refer to cleora github webpage for more details about Cleora.

Getting Started

Steps 1-4 can be run simultaneously

Data preparation. The MAG240M-LSC dataset will be automatically downloaded if not exists to the path denoted in root.py. This takes a while (several hours to a day) in the first run, so please be patient. After decompression, the file size will be around 202GB. Please change its content accordingly if you want to download the dataset to a custom hard-drive or folder. This script creates preprocessed data that is used then during training:
- data/edges_paper_cites_paper_sorted_by_second_column.npy - numpy array with paper->cite->paper edges sorted by cited paper. Used then for fast retrieval of the papers that cited selected paper.
- data/edge_author_paper_sorted_by_paper.npy - numpy array with author->writes->paper sorted by paper. Used then for fast retrieval of all paper authors.
- data/paper2thesameauthors_papers - pickled dict that contains all other papers of the same authors as selected paper
- data/edge_author_paper_small - edges author->paper but only for authors with labelled papers (for faster searching during training)
```
python preprocessing.py
```
Estimated time of preprocessing, without downloading data: 60 minutes
Compute paper sketches from bert features using EMDE
```
python compute_paper_sketches.py
```
It creates memmap file with paper sketches: data/codes_bert_memmap

Estimated time of computing paper sketches: 105 minutes
Compute institutions sketches using Cleora and EMDE
```
python compute_institutions_sketches.py
```
It creates:
- data/inst_codes.npy - memmap file with institutions sketches
- data/paper2inst - pickled dict that contains all institutions for given paper
- data/codes_inst2id - pickled dict that maps institution to its index in data/inst_codes.npy
Estimated time of computing institutions sketches: 55 minutes
Create adjency matrix of graph with paper and author nodes
```
python create_graph.py
```
It creates data/adj.pt file that represents sparse adjency matrix

Estimated time: 60 minutes
Training model for 2 epochs
```
python train.py
```
Final model was trained with 60 ensembles python train.py --num-ensembles 60

Predictions for test set for each ensemble are saved as data/ensemble_{ensemble_id}

Two epochs training time: 40 minutes per one ensemble on Tesla V100 GPU

Inference time for all test data: 7 minutes
Merging ensemble predictions and save the test submission to file y_pred_mag240m.npz
```
python inference.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.gitignore		.gitignore
KDDCup_Synerise.pdf		KDDCup_Synerise.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synerise at KDD Cup 2021: The paper citation challenge

Requirements

Getting Started

About

Releases

Packages

Languages

License

jaroslawkrolewski/kdd-cup-2021

Folders and files

Latest commit

History

Repository files navigation

Synerise at KDD Cup 2021: The paper citation challenge

Requirements

Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages