The Reddit (small) splits seem to have the same data for training, validation and test sets #33

sayanghosh · 2020-11-20T18:23:21Z

The Reddit subsampled dataset location is at https://drive.google.com/file/d/1PwBpAEMYKNpnv64cQ2TIQfSc_vPbq3OQ/view?usp=sharing. However on downloading this zip, we find that the train, validation and test sets are exactly identical. Is there any plan to address this ? Also, are the results presented in the LEAF paper based on this set of subsampled data ? @scaldas

scaldas · 2020-11-20T20:43:19Z

Thank you for bringing this to my attention. You are completely right. I will work on fixing both the data in that link and updating the results on the manuscript. However, because I am the only one maintaining this repo, I can't provide a specific timeline. If you're in a rush and are looking for a similar dataset, consider TFF's Stackoverflow dataset.

sayanghosh · 2020-12-08T00:08:37Z

@scaldas Thanks for letting us know. No hurry, but it would be great if you could alert us when you are ready with the splits.

scaldas · 2021-12-19T15:05:59Z

I have updated the hosted dataset.

HenryHu-H · 2022-04-22T15:51:49Z

Hi, I am also wondering how I can reproduce the results of the Reddit dataset in the LEAF paper. I tried the instructions in https://github.com/TalwalkarLab/leaf/blob/master/data/reddit/README.md, but only produced a result of less than 5% accuracy (whether with a learning rate of 5.65 in the markdown or with a learning rate of 8 in the paper). I also tried a GRU model proposed in https://github.com/microsoft/msrflute/blob/main/experiments/nlg_gru/model.py, but still cannot produce a result with an accuracy higher than 10%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Reddit (small) splits seem to have the same data for training, validation and test sets #33

The Reddit (small) splits seem to have the same data for training, validation and test sets #33

sayanghosh commented Nov 20, 2020

scaldas commented Nov 20, 2020

sayanghosh commented Dec 8, 2020

scaldas commented Dec 19, 2021

HenryHu-H commented Apr 22, 2022

The Reddit (small) splits seem to have the same data for training, validation and test sets #33

The Reddit (small) splits seem to have the same data for training, validation and test sets #33

Comments

sayanghosh commented Nov 20, 2020

scaldas commented Nov 20, 2020

sayanghosh commented Dec 8, 2020

scaldas commented Dec 19, 2021

HenryHu-H commented Apr 22, 2022