Semantic Relatedness Annotation Pipeline

Overview

When creating a semantic relatedness dataset, randomly picking sentences from a corpus to form pairs will likely create mostly unrelated sentence pairs. Also, we want the dataset to include a wide variety of related sentences (in terms of domain, structure, relatedness score, etc). Thus, when creating sentence pairs that people will annotate for relatedness, we need to sample sentences in some clever way.

This repository provides a pipeline to find pairs of sentences that are likely to be semantically related in a given text, generate tuples for best-worst-scaling annotation (see https://www.saifmohammad.com/WebPages/BestWorst.html for more details), format the tuples for Label Studio annotation, process the annotations, and finally create sentence pairs with assigned scores. Please follow these guidelines to create such a file.

Sequence of Tasks

Find a Wide Variety of Semantically Related Pairs:
- The first step is to find sentences that are semantically related in a given corpus. There are many ways to achieve this, one way is lexical overlap, and here we have scripts for lexical overlap as a measure of semantic relatedness.
- Script: semantic_relatedness.py
- Read the paper What Makes Sentences Semantically Related? A Textual Relatedness Dataset and Empirical Study which motivates this shared task.
Generate Best-Worst-Scaling Tuples:
- Once you have the semantically related sentence pairs, the next step is to generate tuples for best-worst-scaling annotation.
- Decide on on the N instances (sentence pairs) right at the beginning and generate 2N 4-tuples using the Best-Worst-Scaling script. Determine your N instances in one go and not add new instances later after annotation has begun.
- Script: generate-BWS-tuples.pl
- Visit Saif Mohammad's website for more details here
- [optional] Read the paper to understand more about Best--Worst Scaling Best--Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation
Format Tuples for Annotation:
- With the generated tuples, you can now format them in a way that they can be uploaded for annotation.
- For Potato, use the Script: potato_annotation_format.py
- For Label Studio use the Script: label_studio_annotation_format.py
- Use the Annotation guide here
- How much data to annotate? A few thousand instances per language are good (e.g., 3000).
- How many annotators? You can use multiple 2: 2 or 4 annotators
Process Annotations:
- After completing the annotation in Label Studio, export the annotation in tsv format
- For Potato, after the annotation, run the following script on the server to export the formatted annotations: export_potato_annotations.py
Generate the Semantic Relatedness Pairs and Score, and the SHR score
- Run the following bash script: process_annotations.sh
- After running the above bash, it will generate multiple files below:
  1. Mapping between Pair and ID: id_to_item.csv
  2. Annotations by ID: annotation_to_eval.csv
  3. Semantic Relatedness PairID and Score: pair_id-scores.csv
  4. Semantic Relatedness Pairs and Score: scored_annotations.tsv
- The Split Half Reliability Score (SHR score) will be printed on the screen.
- Finally, the scored_annotations.tsv will be use for the shared task.

Usage

1. Find Semantically Related Sentences

python semantic_relatedness.py [OPTIONS]

OUTPUT

'data/semantic_related_pairs.tsv' -- tsv file containing semantically related pairs.

sentence1	sentence2
this is sentence1.	this is sentence2.
this is sentence3.	this is sentence4.
this is sentence5.	this is sentence6.

2. Generate Best-Worst-Scaling Tuples

perl generate-BWS-tuples.pl [OPTIONS]

OUTPUT

'data/semantic_related_pairs.tsv' -- tsv file containing semantically related pairs. E.g.

pair1	pair2	pair3	pair4
sentence1. \t sentence2.	sentence1. \t sentence3.	sentence1. \t sentence4.	sentence2. \t sentence3.
sentence3. \t sentence4.	sentence2. \t sentence4.	sentence1. \t sentence4.	sentence1. \t sentence2.

3. Format Tuples for Annotation

For label Studio:

python label_studio_annotation_format.py -i [INPUT_TUPLES] -o [OUTPUT_PATH]

For Potate:

python potato_annotation_format.py -i [INPUT_TUPLES] -o [OUTPUT_PATH]

Where - INPUT_TUPLES: Path to the tsv file containing the tuples. - OUTPUT_PATH: Output path for the annotation samples.

Example

python label_studio_annotation_format.py -i data/tuples.tsv -o data/

OUTPUT

Below is an example of LabelStudio Output.

'data/label_studio_annotation_samples.tsv' -- tsv file containing semantically related pairs ready for Label Studio upload. E.g.

pair1a	pair1b	pair2a	pair2b	pair3a	pair3b	pair4a	pair4b
sentence1.	sentence2.	sentence1.	sentence3.	sentence1.	sentence4.	sentence2.	sentence3.
sentence3.	sentence4.	sentence2.	sentence4.	sentence1.	sentence4.	sentence1.	sentence2.

Below is an example of Potato Output.

"<div class=""tuple""><b>PAIR A</b><br/>1. sentence1.<br/>2. sentence2 </div><br/><div class=""tuple""><b>PAIR B</b><br/>1. sentence1. <br/>2. sentence3.</div><br/><div class=""tuple""><b>PAIR C</b><br/>1. sentence1 <br/>2.sentence4.</div><br/><div class=""tuple""><b>PAIR D</b><br/>1. sentence2.<br/>2. sentence3.</div>",tuple_1

4. Process Annotations

After annotation, and you are using Potato, export the annotation from the Server using the following script.

python export_potato_annotation.py ANNOTATION_PATH OUTPUT_DIR

For LabelStudio, download the tsv of the annotated file.

5. Generate the Semantic Relatedness Pairs and Score, and the SHR score

You can use the following format to generate what is described in step 5 above.

bash process_annotations.sh -a PROCESSED_ANNOTATIONS -t ANNOTATION_TOOL -o OUPUT_DIR

Where:

PROCESSED_ANNOTATIONS: file generated in step 4 ANNOTATION_TOOL: 'label-studio' or 'potato' OUPUT_DIR: Output directory

For example,

bash process_annotations.sh -a PROCESSED_ANNOTATIONS -t ANNOTATION_TOOL -o OUPUT_DIR

OUTPUT

The files listed in step 5 above and the SHR score printed on the console.

Mapping between Pair and ID: id_to_item.csv 2. Annotations by ID: annotation_to_eval.csv 3. Semantic Relatedness PairID and Score: pair_id-scores.csv 4. Semantic Relatedness Pairs and Score: scored_annotations.tsv
- The Split Half Reliability Score (SHR score) will be printed on the screen.
- Finally, the scored_annotations.tsv will be use for the shared task.

Example of file id_to_item.csv:

item	id
this is sentence1. \n this is sentence2.	1
this is sentence3. \n this is sentence4.	2
this is sentence5. \n this is sentence6.	3

Example of file annotation_to_eval.csv:

Item1	Item2	Item3	Item4	BestItem	WorstItem
1	2	3	4	1	2
1	5	6	7	6	5

Example of file pair_id-scores.csv:

id	score
1	1.0
2	0.75
3	0.5

Example of file scored_annotations.tsv:

item	score
this is sentence1. \n this is sentence2.	1.00
this is sentence3. \n this is sentence4.	0.75
this is sentence5. \n this is sentence6.	0.5

Note

Ensure you provide the correct paths to the scripts and data files. If you encounter any issues or have suggestions, please raise an issue or submit a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Semantic Relatedness Annotation Pipeline

Overview

Sequence of Tasks

Usage

1. Find Semantically Related Sentences

2. Generate Best-Worst-Scaling Tuples

3. Format Tuples for Annotation

4. Process Annotations

5. Generate the Semantic Relatedness Pairs and Score, and the SHR score

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

Semantic Relatedness Annotation Pipeline

Overview

Sequence of Tasks

Usage

1. Find Semantically Related Sentences

2. Generate Best-Worst-Scaling Tuples

3. Format Tuples for Annotation

4. Process Annotations

5. Generate the Semantic Relatedness Pairs and Score, and the SHR score

Note