question about msmarco passage ranking dataset #66

lboesen · 2022-11-21T13:18:58Z

Hi :)

Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).

How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)

Thanks in advance

Tan-Hexiang · 2023-02-26T08:59:02Z

I have a similar question. The NQ dataset from this(https://huggingface.co/datasets/Tevatron/wikipedia-nq/tree/main) is not the same as the general used NQ dataset from DPR paper(https://arxiv.org/abs/2004.04906).

i found the problem because the Tevatron/wikipedia-nq dev has only 6489 queries. the dpr NQ dev has 8,757 queies. The origin NQ dev has 7,830 queries.

How was the NQ dataset created? Or which paper does the dataset come from?
@MXueguang

MXueguang · 2023-02-28T06:03:15Z

Hi @Tan-Hexiang, I think I used the code below while filtering the train and dev set.

data = json.load(open("biencoder-nq-dev.json"))
count = 0
for example in data:
    if len(example['positive_ctxs']) > 0 and len(example['hard_negative_ctxs']) >= 8:
        count += 1
print(count)

the file biencoder-nq-dev.json is from the original dpr repo, it contains 6.6k questions.
https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/download_data.py#L38

The reason we did the above filter before is we found having 8 hard negatives in a group sometimes give better effectiveness in our early experiments.

Tan-Hexiang · 2023-02-28T13:32:33Z

@MXueguang Thanks for your reply！
The file biencoder-nq-dev.json description pointed out it can only be used for the Retriever train time validation.

Instead, when validating the retrieval results, maybe the nq-dev.qa.csv file should be used.

I am confused about which file to use when validate the retrieval results in example_dpr.md. As far as I know DPR use the nq-dev.qa.csv that has 8757 queries for validation. So for a fair comparison, i think we should also use the same file as DPR instead of using file with 6.6k questions.

Tan-Hexiang · 2023-02-28T13:49:49Z

Concretely, which dev file the top-k accuracy below corresponds to? nq-dev.qa.csv with 8757 questions? or filtered biencoder-nq-dev.json with 6489 questions.

MXueguang · 2023-02-28T14:29:30Z

Following the original dpr work, all the evaluation was on test set.

  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

here we are encoding test set question for evaluation

Y1Jia · 2023-03-17T02:52:19Z

I have the same question about how the Tevatron/msmarco-passage dataset was created @MXueguang

MXueguang · 2023-03-17T03:02:31Z

Hi @Y1Jia , Tevatron/msmarco-passage data is created from https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#get-data

Y1Jia · 2023-03-17T03:12:32Z

Thank you for getting back to me so fast!

aken12 mentioned this issue Aug 14, 2023

About BM25 hard negatives #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about msmarco passage ranking dataset #66

question about msmarco passage ranking dataset #66

lboesen commented Nov 21, 2022

Tan-Hexiang commented Feb 26, 2023 •

edited

Loading

MXueguang commented Feb 28, 2023

Tan-Hexiang commented Feb 28, 2023 •

edited

Loading

Tan-Hexiang commented Feb 28, 2023

MXueguang commented Feb 28, 2023 •

edited

Loading

Y1Jia commented Mar 17, 2023

MXueguang commented Mar 17, 2023

Y1Jia commented Mar 17, 2023

question about msmarco passage ranking dataset #66

question about msmarco passage ranking dataset #66

Comments

lboesen commented Nov 21, 2022

Tan-Hexiang commented Feb 26, 2023 • edited Loading

MXueguang commented Feb 28, 2023

Tan-Hexiang commented Feb 28, 2023 • edited Loading

Tan-Hexiang commented Feb 28, 2023

MXueguang commented Feb 28, 2023 • edited Loading

Y1Jia commented Mar 17, 2023

MXueguang commented Mar 17, 2023

Y1Jia commented Mar 17, 2023

Tan-Hexiang commented Feb 26, 2023 •

edited

Loading

Tan-Hexiang commented Feb 28, 2023 •

edited

Loading

MXueguang commented Feb 28, 2023 •

edited

Loading