Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about msmarco passage ranking dataset #66

Open
lboesen opened this issue Nov 21, 2022 · 8 comments
Open

question about msmarco passage ranking dataset #66

lboesen opened this issue Nov 21, 2022 · 8 comments

Comments

@lboesen
Copy link

lboesen commented Nov 21, 2022

Hi :)

Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).

How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)

Thanks in advance

@Tan-Hexiang
Copy link

Tan-Hexiang commented Feb 26, 2023

I have a similar question. The NQ dataset from this(https://huggingface.co/datasets/Tevatron/wikipedia-nq/tree/main) is not the same as the general used NQ dataset from DPR paper(https://arxiv.org/abs/2004.04906).

i found the problem because the Tevatron/wikipedia-nq dev has only 6489 queries. the dpr NQ dev has 8,757 queies. The origin NQ dev has 7,830 queries.

How was the NQ dataset created? Or which paper does the dataset come from?
@MXueguang

@MXueguang
Copy link
Contributor

Hi @Tan-Hexiang, I think I used the code below while filtering the train and dev set.

data = json.load(open("biencoder-nq-dev.json"))
count = 0
for example in data:
    if len(example['positive_ctxs']) > 0 and len(example['hard_negative_ctxs']) >= 8:
        count += 1
print(count)

the file biencoder-nq-dev.json is from the original dpr repo, it contains 6.6k questions.
https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/download_data.py#L38

The reason we did the above filter before is we found having 8 hard negatives in a group sometimes give better effectiveness in our early experiments.

@Tan-Hexiang
Copy link

Tan-Hexiang commented Feb 28, 2023

@MXueguang Thanks for your reply!
The file biencoder-nq-dev.json description pointed out it can only be used for the Retriever train time validation.
image
Instead, when validating the retrieval results, maybe the nq-dev.qa.csv file should be used.
截屏2023-02-28 21 29 58
I am confused about which file to use when validate the retrieval results in example_dpr.md. As far as I know DPR use the nq-dev.qa.csv that has 8757 queries for validation. So for a fair comparison, i think we should also use the same file as DPR instead of using file with 6.6k questions.

@Tan-Hexiang
Copy link

Concretely, which dev file the top-k accuracy below corresponds to? nq-dev.qa.csv with 8757 questions? or filtered biencoder-nq-dev.json with 6489 questions.
image

@MXueguang
Copy link
Contributor

MXueguang commented Feb 28, 2023

Following the original dpr work, all the evaluation was on test set.

  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

here we are encoding test set question for evaluation

@Y1Jia
Copy link

Y1Jia commented Mar 17, 2023

I have the same question about how the Tevatron/msmarco-passage dataset was created @MXueguang

@MXueguang
Copy link
Contributor

Hi @Y1Jia , Tevatron/msmarco-passage data is created from https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#get-data

@Y1Jia
Copy link

Y1Jia commented Mar 17, 2023

Thank you for getting back to me so fast!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants