Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Reddit (small) splits seem to have the same data for training, validation and test sets #33

Open
sayanghosh opened this issue Nov 20, 2020 · 4 comments

Comments

@sayanghosh
Copy link

The Reddit subsampled dataset location is at https://drive.google.com/file/d/1PwBpAEMYKNpnv64cQ2TIQfSc_vPbq3OQ/view?usp=sharing. However on downloading this zip, we find that the train, validation and test sets are exactly identical. Is there any plan to address this ? Also, are the results presented in the LEAF paper based on this set of subsampled data ? @scaldas

@scaldas
Copy link
Collaborator

scaldas commented Nov 20, 2020

Thank you for bringing this to my attention. You are completely right. I will work on fixing both the data in that link and updating the results on the manuscript. However, because I am the only one maintaining this repo, I can't provide a specific timeline. If you're in a rush and are looking for a similar dataset, consider TFF's Stackoverflow dataset.

@sayanghosh
Copy link
Author

@scaldas Thanks for letting us know. No hurry, but it would be great if you could alert us when you are ready with the splits.

@scaldas
Copy link
Collaborator

scaldas commented Dec 19, 2021

I have updated the hosted dataset.

@HenryHu-H
Copy link

Hi, I am also wondering how I can reproduce the results of the Reddit dataset in the LEAF paper. I tried the instructions in https://github.com/TalwalkarLab/leaf/blob/master/data/reddit/README.md, but only produced a result of less than 5% accuracy (whether with a learning rate of 5.65 in the markdown or with a learning rate of 8 in the paper). I also tried a GRU model proposed in https://github.com/microsoft/msrflute/blob/main/experiments/nlg_gru/model.py, but still cannot produce a result with an accuracy higher than 10%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants