Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

kinetics datasets #67

Open
karenz17 opened this issue Aug 8, 2019 · 25 comments
Open

kinetics datasets #67

karenz17 opened this issue Aug 8, 2019 · 25 comments

Comments

@karenz17
Copy link

karenz17 commented Aug 8, 2019

I saw you have prepared one copy of the Kinetics dataset and email [email protected] recently, but haven‘t received any reply. If you have time, please check your email and contact to me, thanks a lot!

@Lovelyczl
Copy link

I have the same problem, and hope someone can help me. The original dataset is too big, it's not easy to obtain, especially in China

@kraus-yang
Copy link

Me too, i sent the email and have't received reply.

@busixingxing
Copy link

Here is a way to get the kinetics400.
https://www.dropbox.com/s/wcs01mlqdgtq4gn/compress.tar.gz?dl=1

@msingh27
Copy link

msingh27 commented Jan 7, 2020

Hi @busixingxing ,

Thanks a lot for providing the above link (it contains the train and val data).

Can you also provide a link for the test data set for Kinetics400 ?

@busixingxing
Copy link

Hi @busixingxing ,

Thanks a lot for providing the above link (it contains the train and val data).

Can you also provide a link for the test data set for Kinetics400 ?

Hi, I am not the people who maintain this link, so I am not sure where to get the test data. I think most of researchers just use validation data to estimate the performance of the model.

@ShiroKL
Copy link

ShiroKL commented Jan 14, 2020

@busixingxing Thank for the link it is very helpful.
I have a question regarding the number of video in the archive. I found 234584 videos for the training videos but in the DATASET.MD file it was said there are 234643 videos. I was wondering if the difference is normal ? Does this archive is not the same than the DATASET.MD was referring to ?

@busixingxing
Copy link

busixingxing commented Jan 14, 2020 via email

@ShiroKL
Copy link

ShiroKL commented Jan 15, 2020

@busixingxing Thank for your reply. I investigated a little more and it seems that I have the same number of training files than you before the extraction of the frames. Some extraction does not work because the files are "corrupted" (for instance no streams only sound audio) which results in 234584 training files.
For the training the files are :

{'bowling': {'OErKBwdGJIk_000057_000067'}, 'dancing_ballet': {'7x6LxAdMgb0_000118_000128'}, 'spray_painting': {'OvMUfpc3nHw_000060_000070'}, 'tap_dancing': {'1_nxfkY76mk_000001_000011'}, 'snowkiting': {'pDPbETciXhw_000167_000177'}, 'riding_mountain_bike': {'w5ax4GiTkKg_000088_000098'}, 'wrapping_present': {'rKJk6ws2sGs_000103_000113'}, 'dying_hair': {'fNFXTBUF3nY_000230_000240', 'jHODDw65G4A_000085_000095'}, 'clean_and_jerk': {'zrpjA-ZKGEA_000105_000115'}, 'shaving_head': {'_M6Ko0yRfD4_000097_000107'}, 'sailing': {'99ABSLQdgUc_000046_000056'}, 'assembling_computer': {'xxUezLcXkDs_000256_000266'}, 'air_drumming': {'CUxsn4YXksI_000119_000129'}, 'motorcycling': {'aj1bmhf-IyU_000118_000128'}, 'trapezing': {'_Lw6CGMq4nc_000120_000130'}, 'deadlifting': {'Hm8X9u8jtOk_000022_000032'}, 'eating_hotdog': {'lk5Ap5gZNj0_000009_000019'}, 'catching_fish': {'DSNcuU-e8bU_000021_000031'}, 'snatch_weight_lifting': {'GajaQD6qRkw_000057_000067'}, 'hitting_baseball': {'uz5cIbBTf4Y_000049_000059'}, 'playing_tennis': {'efTAWmCkLKE_000418_000428'}, 'salsa_dancing': {'nfjWfoyGApo_000220_000230'}, 'playing_recorder': {'bgCrldl9pQ8_000027_000037'}, 'crossing_river': {'LSRil2XG1UU_000191_000201'}, 'tai_chi': {'LlflsbkvcKw_000090_000100'}, 'ice_skating': {'9D0o8lh8oeY_002353_002363'}, 'punching_bag': {'ixQrfusr6k8_000001_000011'}, 'cleaning_gutters': {'pM9KHPPo6oE_000046_000056'}, 'sweeping_floor': {'EuGXJiVQwCg_000005_000015'}, 'playing_paintball': {'SZtj2TEWiHc_000195_000205', 'zUZm-IvpnTo_000176_000186'}, 'picking_fruit': {'NLf-rU1wlTY_000161_000171'}, 'bungee_jumping': {'oyj6TFAxpiw_000229_000239'}, 'spinning_poi': {'5_gyoV_sQXU_000001_000011'}}

validation :
'crossing_river': {'ZVdAl-yh9m0'}

If you can check few of them and if they are not corrupted that would be great to create a small archive in order to complete the previous one.

@busixingxing
Copy link

If you can check few of them and if they are not corrupted that would be great to create a small archive in order to complete the previous one.

Since the dataset is big, I did not have a lot of space to extract all the video to frames. In SlowFast, maybe the researcher already set up a filter, so corrupted video did not block the normal training pipeline when I train the model.

I did find out there are some video that have 0 frames or less than 100 frames, and it would stop the training in the other library I used, the MMAction, I have to set up my own filter. I used mmcv to get the number of frames of each video first, if the number of frame is less than 30 in training set, then use another video in the same class to replace it.

In validation set, I had to set the threshold to 85 frames, because the sampling methods in testing seems to be different and require more frames. Hope this message may help you, if you do not want to extract the frames next time.

@Lovelyczl
Copy link

Thanks for sharing @busixingxing
But I still can't download it in China. My VPN connection is ineffective, I am facing rapid disconnection. And it can't retry automatically and continue download as well.
Could you provide a copy of dataset or share the way of downloading in dropbox.
Thanks a lot!

@busixingxing
Copy link

busixingxing commented Feb 12, 2020 via email

@Lovelyczl
Copy link

I really need it. Thanks a lot!

@jiaozizhao
Copy link

I just did count the videos in my training set, the number is 234619, also 13734 in validation set. I probably ignored find the DATASET.MD you are referring to, when I did the data copy. The original dataset only provided the youtube link, and the video can be removed anytime by people who uploaded them. The link I got might be a mirror of data from a facebook researcher working on SlowFast model. Even though my number is not exact the same with yours, it should be close enough. My only experiment with this dataset, is that I ran the training with Facebook slowfast model, and I can repeat their evaluation accuracy on validation set. Therefore, I would assume the dataset is ok to use. Zehua Wei ShiroKL [email protected] 于2020年1月14日周二 上午3:15写道:

@busixingxing https://github.com/busixingxing Thank for the link it is very helpful. I have a question regarding the number of video in the archive. I found 234584 videos for the training videos but in the DATASET.MD file it was said there are 234643 videos. I was wondering if the difference is normal ? Does this archive is not the same than the INSTALL.MD was referring to ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67?email_source=notifications&email_token=ADBCVEGQX6WSBZPINLTYC2DQ5WNFFA5CNFSM4IKMKUFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4HQ7Q#issuecomment-574126206>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCVEFWTJEEN6VLTMCTH6DQ5WNFFANCNFSM4IKMKUFA .

Hi, Thanks for providing the link for the data. You said there was 13734 videos in validation set which is much less than 19761 validation videos used in the paper. Does it matters?

@busixingxing
Copy link

busixingxing commented Feb 28, 2020

I just did count the videos in my training set, the number is 234619, also 13734 in validation set. I probably ignored find the DATASET.MD you are referring to, when I did the data copy. The original dataset only provided the youtube link, and the video can be removed anytime by people who uploaded them. The link I got might be a mirror of data from a facebook researcher working on SlowFast model. Even though my number is not exact the same with yours, it should be close enough. My only experiment with this dataset, is that I ran the training with Facebook slowfast model, and I can repeat their evaluation accuracy on validation set. Therefore, I would assume the dataset is ok to use. Zehua Wei ShiroKL [email protected] 于2020年1月14日周二 上午3:15写道:

@busixingxing https://github.com/busixingxing Thank for the link it is very helpful. I have a question regarding the number of video in the archive. I found 234584 videos for the training videos but in the DATASET.MD file it was said there are 234643 videos. I was wondering if the difference is normal ? Does this archive is not the same than the INSTALL.MD was referring to ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67?email_source=notifications&email_token=ADBCVEGQX6WSBZPINLTYC2DQ5WNFFA5CNFSM4IKMKUFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4HQ7Q#issuecomment-574126206>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCVEFWTJEEN6VLTMCTH6DQ5WNFFANCNFSM4IKMKUFA .

Hi, Thanks for providing the link for the data. You said there was 13734 videos in validation set which is much less than 19761 validation videos used in the paper. Does it matters?

The real situation is, nobody can crawl the complete dataset from youtube anymore. Yes, 13k validation set is a bit less from 19k, but a quite big dataset still.

I used SlowFast Model Zoo's SlowFast R50 8*8 to test on those 13k videos, and I can repeat the result close to the accuracy of the Model Zoo using the downloaded caffe2 model file. I did not have a chance to train my SlowFast fully due to limited GPUs, I tried, and it took 4 GPUs 20 days to train 100 epochs. The default training schedule is 196 epochs.

My result from FB researcher's Caffe2 pretain model is :

Top-1 Accuracy = 74.8 、Top-5 Accuracy = 91.6

Their result is

Top-1 Accuracy = 77.0, Top-5 Accuracy = 92.6

They also said " testing Caffe2 pretrained model in PyTorch might have a small difference in performance". Then I would assume the validation set download from the dropbox link is safe to use.

@jiaozizhao
Copy link

I just did count the videos in my training set, the number is 234619, also 13734 in validation set. I probably ignored find the DATASET.MD you are referring to, when I did the data copy. The original dataset only provided the youtube link, and the video can be removed anytime by people who uploaded them. The link I got might be a mirror of data from a facebook researcher working on SlowFast model. Even though my number is not exact the same with yours, it should be close enough. My only experiment with this dataset, is that I ran the training with Facebook slowfast model, and I can repeat their evaluation accuracy on validation set. Therefore, I would assume the dataset is ok to use. Zehua Wei ShiroKL [email protected] 于2020年1月14日周二 上午3:15写道:

@busixingxing https://github.com/busixingxing Thank for the link it is very helpful. I have a question regarding the number of video in the archive. I found 234584 videos for the training videos but in the DATASET.MD file it was said there are 234643 videos. I was wondering if the difference is normal ? Does this archive is not the same than the INSTALL.MD was referring to ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67?email_source=notifications&email_token=ADBCVEGQX6WSBZPINLTYC2DQ5WNFFA5CNFSM4IKMKUFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4HQ7Q#issuecomment-574126206>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCVEFWTJEEN6VLTMCTH6DQ5WNFFANCNFSM4IKMKUFA .

Hi, Thanks for providing the link for the data. You said there was 13734 videos in validation set which is much less than 19761 validation videos used in the paper. Does it matters?

The real situation is, nobody can crawl the complete dataset from youtube anymore. Yes, 13k validation set is a bit less from 19k, but a quite big dataset still.

I used SlowFast Model Zoo's SlowFast R50 8*8 to test on those 13k videos, and I can repeat the result close to the accuracy of the Model Zoo using the downloaded caffe2 model file. I did not have a chance to train my SlowFast fully due to limited GPUs, I tried, and it took 4 GPUs 20 days to train 100 epochs. The default training schedule is 196 epochs.

My result from FB researcher's Caffe2 pretain model is :

Top-1 Accuracy = 74.8 、Top-5 Accuracy = 91.6

Their result is

Top-1 Accuracy = 77.0, Top-5 Accuracy = 92.6

They also said " testing Caffe2 pretrained model in PyTorch might have a small difference in performance". Then I would assume the validation set download from the dropbox link is safe to use.

Hi, Thanks for replying. However I don't think 74.8 is close to 77.0. I downloaded the dataset from the link you gave. But I can find 19736 videos in the validation set. The only problem is they have different format. some videos are .mp4, some are mkv and some are web. Is it because that you only count mp4 then you got 13K? Could you help me to check?

@busixingxing
Copy link

I just did count the videos in my training set, the number is 234619, also 13734 in validation set. I probably ignored find the DATASET.MD you are referring to, when I did the data copy. The original dataset only provided the youtube link, and the video can be removed anytime by people who uploaded them. The link I got might be a mirror of data from a facebook researcher working on SlowFast model. Even though my number is not exact the same with yours, it should be close enough. My only experiment with this dataset, is that I ran the training with Facebook slowfast model, and I can repeat their evaluation accuracy on validation set. Therefore, I would assume the dataset is ok to use. Zehua Wei ShiroKL [email protected] 于2020年1月14日周二 上午3:15写道:

@busixingxing https://github.com/busixingxing Thank for the link it is very helpful. I have a question regarding the number of video in the archive. I found 234584 videos for the training videos but in the DATASET.MD file it was said there are 234643 videos. I was wondering if the difference is normal ? Does this archive is not the same than the INSTALL.MD was referring to ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67?email_source=notifications&email_token=ADBCVEGQX6WSBZPINLTYC2DQ5WNFFA5CNFSM4IKMKUFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4HQ7Q#issuecomment-574126206>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCVEFWTJEEN6VLTMCTH6DQ5WNFFANCNFSM4IKMKUFA .

Hi, Thanks for providing the link for the data. You said there was 13734 videos in validation set which is much less than 19761 validation videos used in the paper. Does it matters?

The real situation is, nobody can crawl the complete dataset from youtube anymore. Yes, 13k validation set is a bit less from 19k, but a quite big dataset still.
I used SlowFast Model Zoo's SlowFast R50 8*8 to test on those 13k videos, and I can repeat the result close to the accuracy of the Model Zoo using the downloaded caffe2 model file. I did not have a chance to train my SlowFast fully due to limited GPUs, I tried, and it took 4 GPUs 20 days to train 100 epochs. The default training schedule is 196 epochs.
My result from FB researcher's Caffe2 pretain model is :

Top-1 Accuracy = 74.8 、Top-5 Accuracy = 91.6

Their result is

Top-1 Accuracy = 77.0, Top-5 Accuracy = 92.6

They also said " testing Caffe2 pretrained model in PyTorch might have a small difference in performance". Then I would assume the validation set download from the dropbox link is safe to use.

Hi, Thanks for replying. However I don't think 74.8 is close to 77.0. I downloaded the dataset from the link you gave. But I can find 19736 videos in the validation set. The only problem is they have different format. some videos are .mp4, some are mkv and some are web. Is it because that you only count mp4 then you got 13K? Could you help me to check?

I think you are right. There are a lot of videos that are not in mp4 format, I did not count that for the reply in my previous response, sorry for the confusion.

When I wrote the script, I only used the .mp4. That may be another reason my result for top-1 is lower.

My coworker mentioned he did another work unified all videos' format after my test job. Depend on your input pipeline, maybe simply change .mkv to .mp4 works too.

@zswzifir
Copy link

zswzifir commented Apr 1, 2020

Here is a way to get the kinetics400.
https://www.dropbox.com/s/wcs01mlqdgtq4gn/compress.tar.gz?dl=1

Hi, @busixingxing , I download the 'compress.tar.gz', but it failed when unzipping 217 classes,. I wonder that if you can provide the md5 code.
Very thanks!

Some errors:

tar: Skipping to next header
tar: Substituting `.' for empty member name
tar: .: Cannot open: Is a directory
tar: Skipping to next header
tar: Archive contains ‘\344\370\203\032\354ʷ'!X\222\310’ where numeric off_t value expected
...

@busixingxing
Copy link

Here is a way to get the kinetics400.
https://www.dropbox.com/s/wcs01mlqdgtq4gn/compress.tar.gz?dl=1

Hi, @busixingxing , I download the 'compress.tar.gz', but it failed when unzipping 217 classes,. I wonder that if you can provide the md5 code.
Very thanks!

Some errors:

tar: Skipping to next header
tar: Substituting `.' for empty member name
tar: .: Cannot open: Is a directory
tar: Skipping to next header
tar: Archive contains ‘\344\370\203\032\354ʷ'!X\222\310’ where numeric off_t value expected
...

Hi, I don't have the md5 either, I deleted the raw file after the unzip process.

@gurkirt
Copy link

gurkirt commented May 27, 2020

@busixingxing @zswzifir

Here is a way to get the kinetics400.
https://www.dropbox.com/s/wcs01mlqdgtq4gn/compress.tar.gz?dl=1

The training set has 234619 .mp4 videos. And 19761 videos not 13734 videos in the validation set. It has 13734 .mp4 videos remaining videos are with .webm or .mkv extension.

@lukas-larsson
Copy link

Could someone provide a new dropbox link for the dataset (or similar)? The one in this thread has expired unfortunately.

@LiuChaoXD
Copy link

Here is a way to get the kinetics400.
https://www.dropbox.com/s/wcs01mlqdgtq4gn/compress.tar.gz?dl=1

The url is out date. Do you mind to share the kinetics400 data set again?

@youngwanLEE
Copy link

youngwanLEE commented Dec 28, 2020

@LiuChaoXD @lukas-larsson @daodao316 @makecent @KangSooHan
We provide the dataset link.
https://github.com/youngwanLEE/VoV3D/blob/main/DATA.md#kinetics-400

@applesleam
Copy link

applesleam commented Feb 19, 2021

@youngwanLEE
Copy link

@applesleam
The link is fixed.
Try it again.

@alexnwang
Copy link

@youngwanLEE
The link seems to be dead again, would it be possible to update?
Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests