-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about msmarco passage ranking dataset #66
Comments
I have a similar question. The NQ dataset from this(https://huggingface.co/datasets/Tevatron/wikipedia-nq/tree/main) is not the same as the general used NQ dataset from DPR paper(https://arxiv.org/abs/2004.04906). i found the problem because the Tevatron/wikipedia-nq dev has only 6489 queries. the dpr NQ dev has 8,757 queies. The origin NQ dev has 7,830 queries. How was the NQ dataset created? Or which paper does the dataset come from? |
Hi @Tan-Hexiang, I think I used the code below while filtering the train and dev set.
the file The reason we did the above filter before is we found having 8 hard negatives in a group sometimes give better effectiveness in our early experiments. |
@MXueguang Thanks for your reply! |
Following the original dpr work, all the evaluation was on test set. --output_dir=temp \
--model_name_or_path model_nq \
--fp16 \
--per_device_eval_batch_size 156 \
--dataset_name Tevatron/wikipedia-nq/test \
--encoded_save_path query_emb.pkl \
--encode_is_qry here we are encoding test set question for evaluation |
I have the same question about how the Tevatron/msmarco-passage dataset was created @MXueguang |
Hi @Y1Jia , Tevatron/msmarco-passage data is created from https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#get-data |
Thank you for getting back to me so fast! |
Hi :)
Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).
How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)
Thanks in advance
The text was updated successfully, but these errors were encountered: