This repository contains some scripts to run experiments on different options of coreference search in REL, and to evaluate the results in a notebook and a report. Experiments are run both on the AIDA data and on the msmarco data.
The data directory stores necessary inputs and outputs from the scripts below. The directory should have the following structure:
data
|___ ed-wiki-2019
| ...
|___ generic
| ...
|___ wiki_2019
| ...
|___ msmarco_large_extract
| sample_1k_longdocs.parquet
| sample_1k_longdocs.gz
|___ efficiency_test
Details:
ed-wiki-2019
,generic
,wiki_2019
: standard data also used in REL. for downloading them, see theREL/scripts/download_data.sh
- msmarco contains the following files I obtained from Chris:
- file with mentions (
sample_1k_longdocs.parquet
) - source file (
sample_1k_longdocs.gz
)
- file with mentions (
- The empty
efficiency_test
where output fromREL/scripts/run_efficiency_tests.sh
(currently in PR) is stored.
Now install the relevant software. Because of some conflicting requirements, three conda environments are necessary:
rebl_env
: for running the batch ED with REBLrel_env
: for running efficiency test scripts in RELnotebook_env
: for running the jupyter notebook analysing the experimental results
cd some_empty_directory
# Install
# when PR https://github.com/informagi/REL/pull/153 is merged, clone REL directly
git clone -b flavio/coref-lsh [email protected]:f-hafner/REL.git
git clone -b lsh-integration [email protected]:informagi/REBL.git
git clone [email protected]:f-hafner/rel_coref_experiments.git
# set up environment for REBL
cd rel_coref_experiments
conda activate
conda env create --prefix ./rebl_env --file envs/rebl_environment.yml
conda activate ./rebl_env
pip install ../REBL/
pip install -e ../REL/.
# Set up environment for REL
conda deactivate
conda create python=3.7 --prefix ./rel_env # TODO: transform this into an environment as env_rebl above?
conda activate rel_env
pip install -e REL/.[develop]
BASE_URL="path/to/data/directory" # replace here the URL to the data. for me: "/var/scratch/fhafner/rel_data/"
# # 1.Run REBL
conda activate
conda activate ./rebl_env
bash pipeline.sh $BASE_URL
# # 2. Run efficiency tests
conda activate ./rel_env
cd ../REL/
bash scripts/run_efficiency_tests.sh $BASE_URL # change directory and settings in REL/scripts/efficiency_test.py. or change code in PR?
Details
- both
pipeline.sh
andREL/scripts/run_efficiency_tests.sh
need one argument that indicates the directory to the data pipeline.sh
contains other variables, but they do not need to be changed:COREF_OPTIONS
,NDOCS_TIMING
We'll need another environment for the notebook.
conda activate
conda env create --file envs/notebook_environment.yml --prefix ./notebook_env
In the folder output_data
are the files created during my experiments. These experiments were run on the DAS-6 cluster, using CPUs only.
msmarco
: results from running REBL abovepredictions
: predicted entities, for all documents together.profile
: profiling output from ED step
efficiency_test
: results from running efficiency test abovepredictions
: predicted entities, for all documents together.timing
: timing for ED, for each document separately.
These outputs are processed in the notebook analysis.ipynb
, which evaluates predictions and makes some plots. Everything is then also discussed in text form in /tex/coreferences.tex. The data in the tables are not automatically updated.