[Feedback welcome] CLI to upload arbitrary huge folder #2254

Wauplin · 2024-04-26T14:48:55Z

What for?

Upload arbitrarily large folders in a single command line!

⚠️ This tool is still experimental and is meant for power users. Expect some rough edges in the process. Feedback and bug reports would be very much appreciated ❤️

How to use it?

Install

pip install git+https://github.com/huggingface/huggingface_hub@large-upload-cli

Upload folder

huggingface-cli large-upload <repo-id> <local-path>

Every minute a report is printed to the terminal with the current status. Apart from that, progress bars and errors are still displayed.

Large upload status:
  Progress:
    104/104 hashed files (22.5G/22.5G)
    0/42 preuploaded LFS files (0.0/22.5G) (+4 files with unknown upload mode yet)
    58/104 committed files (24.9M/22.5G)
    (0 gitignored files)
  Jobs:
    sha256: 0 workers (0 items in queue)
    get_upload_mode: 0 workers (4 items in queue)
    preupload_lfs: 6 workers (36 items in queue)
    commit: 0 workers (0 items in queue)
  Elapsed time: 0:00:00
  Current time: 2024-04-26 16:24:25

Run huggingface-cli large-upload --help to see all options.

What does it do?

This CLI is intended to upload arbitrary large folders in a single command:

process is split in 4 steps: hash, get upload mode, lfs upload, commit
retry on error at each step
multi-threaded: workers are managed with queues
resumable: if the process is interrupted, you can re-run it. Only partially uploaded files are lost.
files are hashed only once
starts to upload files while other files are still been hashed
commit at most 50 files at a time
prevent concurrent commits
prevent rate limits as much as possible
prevent small commits
retry on error for all steps

A .hugginface/ folder will be created at the root of your folder to keep track of the progress. Please do not modify these files manually. If you feel this folder got corrupted, please report it here, delete the .huggingface/ entirely and then restart you command. Some intermediate steps will be lost but the upload process should be able to continue correctly.

Known limitations

cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.
not optimized for hf_transfer (though it works) => better to set --num-workers to 2 otherwise CPU will be bloated
cannot delete files on repo while uploading folder
cannot set commit message/commit description
cannot create PR by itself => you must first create a PR manually, then provide revision

What to review?

Nothing yet.

For now the goal is to gather as much feedback as possible. If it proves successful, I will clean the implementation and make it more production-ready. Also, this PR is built on top of #2223 that is not merged yet, which makes the changes very long.

For curious people, here is the logic to decide what should be the next task to perform.

Co-authored-by: Lysandre Debut <[email protected]>

Wauplin · 2024-05-03T07:42:03Z

Feedback so far:

when connection is slow, better to reduce the number of workers. Should we do that automatically or just print a message? Reducing number of workers might not speed-up upload but at least less files are uploaded in parallel => less chances to loose progress in case of failed upload.
terminal output is too verbose. Might be good to disable individual progress bars?
terminal output is awful in a jupyter notebook => how can we make that more friendly (printing a report every 1 minute ends up with very long logs)
a CTRL-C (or at most 2 CTRL+C) must stop the process. It's not the case at moment due to all try/except.

EDIT:

should print warning when upload parquet/arrow files to a model repository. It is not possible to convert a model to dataset repo afterwards so better being sure.

EDIT:

might create some empty commits in some cases (if files already committed). Bad UX if resuming.

davanstrien · 2024-05-15T10:02:19Z

IMO, it would make sense for this not to default to uploading as a model repo i.e. require this:

huggingface-cli large-upload <repo-id> <local-path> --repo-type dataset

If a user runs:

huggingface-cli large-upload <repo-id> <local-path>

they should get an error along the lines of "Please specify the repo type you want to use"

Quite a few people using this tool have accidentally uploaded a dataset to a model repo, and currently, it's not easy to move this to a dataset repo.

I know that many of the huggingface_hub methods/functions default to model repos, but I think that doesn't make sense in this case since:

it's more/equally likely to be used for uploading datasets as model weights
since the goal is to support large uploads the cost of getting it wrong for the user is quite high

julien-c · 2024-05-15T10:45:43Z

ah i rather agree with @davanstrien here

wanng-ide · 2024-05-16T13:47:26Z

Can the parameters of "large-upload" be aligned to the "upload"?
huggingface-cli large-upload [repo_id] [local_path]

Wauplin · 2024-05-22T12:40:15Z

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path

$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

wanng-ide · 2024-05-22T13:25:02Z

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path

$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

what about: huggingface-cli large-upload [local_path] [path_in_repo]
ADD [path_in_repo]

Wauplin · 2024-05-22T14:18:05Z

I'm not sure to understand what's the purpose of the ADD keyword

rom1504 · 2024-05-29T18:04:15Z

Will this only be a cli or also a python function? I liked the python API for upload folder. Convenient to automate sending many datasets in python rather than bash

Wauplin · 2024-05-30T11:24:15Z

Will this only be a cli or also a python function?

Yes, that's the goal. At the moment, it is defined as a standalone method large_upload() (see here). In a final version, we will probably add it to HfApi client.

rom1504 · 2024-06-01T08:13:12Z

I'm using it to upload a few 300GB datasets. The standard upload function was taking more than 30min just to hash the files and then was crashing half way in upload. This seems to be working much better.

rom1504 · 2024-06-08T21:33:59Z

Ok I got one more piece of feedback actually...
Looks like this tool is too fast :)

It seems to be killing my box for a few hours (after uploading at 80MB/s for a few hours). I don't really get how that's possible yet.

What would you advise to reduce the speed a bit / reduce the number of simultaneous connections ?

Wauplin · 2024-06-10T09:47:34Z

Wow, this is an unexpected problem 😄 I can think of two ways of reducing the upload speed:

don't use hf_transfer if you were previously doing it
set --num-workers=1 (or 2/3) to reduce the number of workers uploading files in parallel. However there is currently no way to throttle the connection from huggingface_hub directly. There is a separate issue for that (see Throttle download speed #2118 (comment)) but I don't think we'll ever work on such a feature. You can set this from your setup with a proxy I believe, though it's quite hacky.

ducha-aiki · 2024-06-10T12:29:22Z

@Wauplin
Just downloaded and run, got the following error:

> huggingface-cli large-upload  hoverinc/mydataset_test data

  File "/home/dmytromishkin/miniconda3/envs/pytorch/bin/huggingface-cli", line 5, in <module>
    from huggingface_hub.commands.huggingface_cli import main
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/commands/huggingface_cli.py", line 21, in <module>
    from huggingface_hub.commands.large_upload import LargeUploadCommand
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/commands/large_upload.py", line 29, in <module>
    from huggingface_hub.large_upload import large_upload
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/large_upload.py", line 513, in <module>
    def _get_one(queue: queue.Queue[JOB_ITEM_T]) -> List[JOB_ITEM_T]:
TypeError: 'type' object is not subscriptable

Wauplin · 2024-06-10T12:31:39Z

@ducha-aiki Which Python version are you using? Could you try to upgrade to 3.10 and let me know if it still happens? I suspect queue.Queue[JOB_ITEM_T] to be forbidden in Python 3.8

ducha-aiki · 2024-06-10T12:47:24Z

Oh no, my favorite 2 yo environment... you got me, that was Python 3.8.12.
Trying on 3.10 now, no error, but also nothing got uploaded...the repo is either not created( tried without repo), or empty (when tried to creating repo first)

INFO:huggingface_hub.large_upload:

##########
Large upload status:
  Progress:
    57/57 hashed files (15.5M/15.5M)
    57/57 preuploaded LFS files (15.5M/15.5M)
    57/57 committed files (15.5M/15.5M)
    (0 gitignored files)
  Jobs:
    sha256: 0 workers (0 items in queue)
    get_upload_mode: 0 workers (0 items in queue)
    preupload_lfs: 0 workers (0 items in queue)
    commit: 0 workers (0 items in queue)
  Elapsed time: 0:01:00
  Current time: 2024-06-10 12:44:23
##########

The folder structure I am trying to upload:

data/*.parquet

Wauplin · 2024-06-10T12:51:30Z

@ducha-aiki Are you sure nothing has been created on the Hub? Can you delete the .huggingface/ cache folder that should have been created in the local folder and retry?

ducha-aiki · 2024-06-10T12:52:12Z

@Wauplin update: the files finally appeared now, although they pretend to be added 6 min ago. Anything, everything works now, thank you :)

Wauplin and others added 30 commits April 12, 2024 15:11

still an early draft

1004434

this is better

5a8605e

fix

68a6cf1

Merge branch 'main' intto 1738-revampt-download-local-dir

9b25f38

revampt/refactor download process

8e903f8

resume download by default + do not upload .huggingface folder

5f610ee

compute sha256 if necessary

5a9762f

fix hash

283977d

add tests + fix some stuff

e909022

fix snapshot download tests

39cfef4

fix test

dbece97

lots of docs

0206964

add secu

82b46b3

as constant

3300b28

dix

c606a94

fix tests

95171ef

remove unused code

7180746

don't use jsons

4e664d4

style

7bb263e

Apply suggestions from code review

3595042

Co-authored-by: Lysandre Debut <[email protected]>

Apply suggestions from code review

3401880

Co-authored-by: Lysandre Debut <[email protected]>

Warn more about resume_download

9210648

fix test

fb477e5

Add tests specific to .huggingface folder

0eacbc9

remove advice to use hf_transfer when downloading from cli

1a4320a

fix torhc test

8c9dc8b

more test fix

6260a17

Merge branch 'main' into 1738-revampt-download-local-dir

c788e2d

feedback

3a45f4b

First draft for large upload CLI

30abe07

Wauplin added 12 commits April 29, 2024 08:29

robust tests

d0ea3ea

fix CI

dffa539

ez

d414825

rules update

0a5605c

more ribust?

9e6d569

allow for 1s diff

e6fe766

don't raise on unlink

28991c9

style

a0b61a1

robustenss

fccabe0

Merge branch '1738-revampt-download-local-dir' into large-upload-cli

c089477

Merge branch 'main' into large-upload-cli

6295ead

tqdm while recovering

06c9dec

Wauplin mentioned this pull request Jul 9, 2024

huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url #2375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feedback welcome] CLI to upload arbitrary huge folder #2254

[Feedback welcome] CLI to upload arbitrary huge folder #2254

Wauplin commented Apr 26, 2024 •

edited

Loading

Wauplin commented May 3, 2024 •

edited

Loading

davanstrien commented May 15, 2024

julien-c commented May 15, 2024

wanng-ide commented May 16, 2024

Wauplin commented May 22, 2024

wanng-ide commented May 22, 2024

Wauplin commented May 22, 2024

rom1504 commented May 29, 2024 •

edited

Loading

Wauplin commented May 30, 2024

rom1504 commented Jun 1, 2024 •

edited

Loading

rom1504 commented Jun 8, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

[Feedback welcome] CLI to upload arbitrary huge folder #2254

Are you sure you want to change the base?

[Feedback welcome] CLI to upload arbitrary huge folder #2254

Conversation

Wauplin commented Apr 26, 2024 • edited Loading

What for?

How to use it?

What does it do?

Known limitations

What to review?

Wauplin commented May 3, 2024 • edited Loading

davanstrien commented May 15, 2024

julien-c commented May 15, 2024

wanng-ide commented May 16, 2024

Wauplin commented May 22, 2024

wanng-ide commented May 22, 2024

Wauplin commented May 22, 2024

rom1504 commented May 29, 2024 • edited Loading

Wauplin commented May 30, 2024

rom1504 commented Jun 1, 2024 • edited Loading

rom1504 commented Jun 8, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

Wauplin commented Jun 10, 2024

ducha-aiki commented Jun 10, 2024

Wauplin commented Apr 26, 2024 •

edited

Loading

Wauplin commented May 3, 2024 •

edited

Loading

rom1504 commented May 29, 2024 •

edited

Loading

rom1504 commented Jun 1, 2024 •

edited

Loading