This repo contains the code used for the EMNLP 2023 paper "Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models".
- Python 3.8
- Python packages
- PyTorch 2.0+
- transformers 4.24.0
- openai
- tqdm
- wandb
- pandas
The DocRED and Re-DocRED dataset can be downloaded following the instructions at the corresponding links.
Our enhanced dataset can be obtained with the following command:
wget https://bigai-nlco.s3.ap-southeast-1.amazonaws.com/DocGNRE/enhancement_data.zip
We provide an enhanced test set (in "enhancement_data/re_docred_test_data_enhancement.json") after manually refining. We also provide four training datasets enhanced by our distant annotations.
You can run one command to automatically generate distantly enhanced datasets.
bash Automatical_Relation_Generation/run.sh
STEP1
The Automatical_Relation_Generation/I_gpt_proposal.py
script generates additional triples for each document in the original dataset.
This code requires OpenAI's model APIs. Accessing the API requires an API key, which you can obtain by creating an account and going to the official website.
Example to run (cd Automatical_Relation_Generation
):
python I_gpt_proposal.py -i ${input_file_path} -o ${output_file_path}
Arguments:
- -i, --input: Path to the input file, such as the train file path of Re-DocRED.
- -o, --output: Path to the output file.
The Automatical_Relation_Generation/I_gpt_proposal_more.py
script generates more additional triples through an iterative approach by feeding the previous GPT answers as input.
STEP2
The Automatical_Relation_Generation/II_gpt_triples_postprocess.py
script filters undesired or illegal triples.
Example to run:
python II_gpt_triples_postprocess.py -i ${step1_output_file_path} -o ${output_file_path}
Arguments:
- -i, --input: Path to the input file.
- -o, --output: Path to the output file.
STEP3
The Automatical_Relation_Generation/III_nli_annotator.py
script calculates the entailment scores used for predefined relation types.
Example to run:
python III_nli_annotator.py -i ${step2_output_file_path} -o ${output_file_path}
Arguments:
- -i, --input: Path to the input file.
- -o, --output: Path to the output file.
STEP4
The Automatical_Relation_Generation/IV_nli_score_postprocess.py
script supplements relations according to entailment scores to ensure the high quality of newly added triples.
Example to run:
python IV_nli_score_postprocess.py -origin ${step1_input_file_path} -i ${step3_output_file_path} -o ${output_file_path}
Arguments:
- -origin, --origin: Path to the original file.
- -i, --input: Path to the input file.
- -o, --output: Path to the output file.
The codebase of this repo is extended from DREEAM.
This work mainly designs an automated annotation method, so there is basically no difference between model training and evaluation.
Just change the training set file in the DocRE/scripts/run_bert_gpt.sh
and DocRE/scripts/run_roberta_gpt.sh
to complete the training.
Run below:
bash scripts/run_bert_gpt.sh ${name} ${lambda} ${seed} # for BERT
bash scripts/run_roberta_gpt.sh ${name} ${lambda} ${seed} # for RoBERTa
where ${name}
is the identifier of this run displayed in wandb,
${lambda}
is the scaler that controls the weight of evidence loss,
and ${seed}
is the value of random seed.
Make predictions on the enhanced test set with the commands below:
bash DocRE/scripts/isf_bert.sh ${name} ${model_dir} ${test_file_path} # for BERT
bash DocRE/scripts/isf_roberta.sh ${name} ${model_dir} ${test_file_path} # for RoBERTa
where ${model_dir}
is the directory that contains the checkpoint we are going to evaluate.
The program will generate a test file result.json
in the official evaluation format.
Generated input example:
{
"title": "Culrav",
"sents": [
["Culrav", "is", "a", "cultural", "festival", "of", "Motilal", "Nehru", "National", ...],
["Culrav", "gives", "a", "platform", "to", "the", "students", "of", "MNNIT", ...],
...],
"vertexSet": [
[
{
"sent_id": 0,
"type": "MISC",
"pos": [0, 1],
"name": "Culrav",
"global_pos": [0, 0],
"index": "0_0"
},
{
"sent_id": 1,
"type": "MISC",
"pos": [0, 1],
"name": "Culrav",
"global_pos": [15, 15],
"index": "0_1"},
...],
...],
"labels": [
{
"r": "P17",
"h": 0,
"t": 3,
"evidence": [0, 1, 3]
},
{
"r": "P131",
"h": 1,
"t": 2,
"evidence": [0]
},
...],
"gpt_labels": [
{
"h": 0,
"r": "P127",
"t": 1,
"score": 0.758
},
{
"h": 0,
"r": "P276",
"t": 4,
"score": 0.662
},
...]
}
@inproceedings{docgnre,
title = "Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models",
author = "Li, Junpeng and Jia, Zixia and Zheng, Zilong",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
publisher = "Association for Computational Linguistics"
}
The codebase of this repo is extended from DREEAM.