[BUG]Autotrain dgx not working for DPO and ORPO #815

jmparejaz · 2024-11-26T19:28:22Z

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Error Logs

error_dgx.txt

Additional Information

this is my second issue thread about dgx finetuning not working for alignment.
Initially, it was only ORPO, now It is not working either for DPO
Please help, AutoTrain DGX is awesome but not working recently

abhishekkrthakur · 2024-11-26T19:30:20Z

could you please paste the error

jmparejaz · 2024-11-26T19:34:28Z

the error logs are attached in a txt file.
however check it here:

INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:82 - model dtype: torch.float16
INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:99 - creating trainer
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/clm/__main__.py", line 38, in train
    train_dpo(config)
  File "/app/src/autotrain/trainers/clm/train_clm_dpo.py", line 100, in train
    callbacks = utils.get_callbacks(config)
  File "/app/src/autotrain/trainers/clm/utils.py", line 816, in get_callbacks
    callbacks = [UploadLogs(config=config), LossLoggingCallback(), TrainStartCallback()]
  File "/app/src/autotrain/trainers/common.py", line 314, in __init__
    self.api.create_repo(
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3531, in create_repo
    hf_raise_for_status(r)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:216 - 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo
[rank3]:[W1126 07:19:58.100088970 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank2]:[W1126 07:20:01.283553003 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank5]:[W1126 07:20:01.486946432 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank7]:[W1126 07:20:01.504053966 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank6]:[W1126 07:20:01.519639703 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank4]:[W1126 07:20:01.524438380 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank1]:[W1126 07:20:01.568074344 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
INFO     | 2024-11-26 07:20:24 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 126
INFO     | 2024-11-26 07:20:24 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 126
INFO     | 2024-11-26 07:20:24 | autotrain.app.training_api:run_main:56 - No running jobs found. Shutting down the server.
INFO     | 2024-11-26 07:20:24 | autotrain.app.training_api:graceful_exit:35 - SIGTERM received. Performing cleanup...
ERROR:    Traceback (most recent call last):
  File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/app/src/autotrain/app/training_api.py", line 57, in run_main
    kill_process_by_pid(os.getpid())
  File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid
    os.kill(pid, signal.SIGTERM)
  File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit
    sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/app/env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
  File "/app/env/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<BackgroundRunner.run_main() done, defined at /app/src/autotrain/app/training_api.py:52> exception=SystemExit(0)>
Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 412, in main
    run(
  File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/app/env/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/app/src/autotrain/app/training_api.py", line 57, in run_main
    kill_process_by_pid(os.getpid())
  File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid
    os.kill(pid, signal.SIGTERM)
  File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit
    sys.exit(0)
SystemExit: 0```

abhishekkrthakur · 2024-11-26T19:36:02Z

the error says you are creating a project with a name that matches a repo in your hf account. please use a unique name :)

jmparejaz · 2024-11-26T19:40:06Z

yes I know but that is not the case, the repo name was completely new.
The training was doing right but suddenly it breaks, here the commit history

I tested it multiple times, and the same error happened no matter how i named the repo

it was running only for 30 mins then breaks

abhishekkrthakur · 2024-11-26T19:43:35Z

INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:82 - model dtype: torch.float16
INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:99 - creating trainer
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/clm/__main__.py", line 38, in train
    train_dpo(config)
  File "/app/src/autotrain/trainers/clm/train_clm_dpo.py", line 100, in train
    callbacks = utils.get_callbacks(config)
  File "/app/src/autotrain/trainers/clm/utils.py", line 816, in get_callbacks
    callbacks = [UploadLogs(config=config), LossLoggingCallback(), TrainStartCallback()]
  File "/app/src/autotrain/trainers/common.py", line 314, in __init__
    self.api.create_repo(
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3531, in create_repo
    hf_raise_for_status(r)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo

here it just says its creating a repo that already exists.

are you talking about another project which failed in the middle of the training unrelated to the error you posted above?

jmparejaz · 2024-11-26T19:49:09Z

yes, I know it says that... but it doesnt reflect the reality,
Check the txt file with the complete log history.
it starts with project name dpolong22
it starts training but suddenly it breaks and the error message is that repo already exists. That is the reason why I creating the issue, because I cant understand why it is happening, it doesnt make sense.

1-26 07:08:51 | autotrain.app.training_api:<module>:95 - AUTOTRAIN_USERNAME: growth-cadet
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:96 - PROJECT_NAME: dpolong22
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:97 - TASK_ID: 9
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:98 - DATA_PATH: growth-cadet/jobpost-2-signals_orpo_alignment_completion
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:99 - MODEL: growth-cadet/qwen2-7b-signals-department-TO-JSON07-2
INFO:     Started server process [60]
INFO:     Waiting for application startup.
INFO     | 2024-11-26 07:08:54 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--multi_gpu', '--num_machines', '1', '--num_processes', '8', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'dpolong22/training_params.json']
INFO     | 2024-11-26 07:08:54 | autotrain.commands:launch_command:524 - {'model': 'growth-cadet/qwen2-7b-signals-department-TO-JSON07-2', 'project_name': 'dpolong22', 'data_path': 'growth-cadet/jobpost-2-signals_orpo_alignment_completion', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'dpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 3e-05, 'epochs': 10, 'batch_size': 4, 'warmup_ratio': 0.05, 'gradient_accumulation': 2, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'linear', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'quantization': 'int4', 'target_modules': 'q_proj, o_proj, k_proj,v_proj', 'merge_adapter': True, 'peft': True, 'lora_r': 8, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'growth-cadet', 'token': '*****', 'unsloth': True, 'distributed_backend': None}
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:lifespan:82 - Started training with PID 126
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)```


I can create a new training with a new name and would happend again.
as I mentioned, I tested it multiple times before creating the issue

abhishekkrthakur · 2024-11-26T19:50:10Z

ohkay. let me take a look and come back to you.

jmparejaz · 2024-11-29T14:20:25Z

hi @abhishekkrthakur
any update on this issue? I noticed that there have been a couple of updates on autotrain versions.
But the problem keeps happening.
The finetuning always starts well and after some commits it breaks.
I have tested like 15 or 20 times and only one have completed (it was with 1 epoch)

The error message is the same always (huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6749c7c9-5d23398514fc94d343d3929c;793eafad-5bea-4aaa-b337-3f078eaee34a) You already created this model repo ERROR | 2024-11-29 13:55:21 | autotrain.trainers.common:wrapper:216 - 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6749c7c9-5d23398514fc94d343d3929c;793eafad-5bea-4aaa-b337-3f078eaee34a) You already created this model repo)

but it doesn't make sense since the autotrain UI creates a brand-new repo.
I think the problem is within the code that is triggering to create the repo again during the middle of the finetuning process.

abhishekkrthakur · 2024-11-29T14:21:30Z

ive asked nvidia team for the logs. still waiting for their response.

abhishekkrthakur · 2024-11-29T14:22:09Z

you could also use space hardware instead of dgx cloud to make sure its a dgx cloud issue.

jmparejaz added the bug Something isn't working label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Autotrain dgx not working for DPO and ORPO #815

[BUG]Autotrain dgx not working for DPO and ORPO #815

jmparejaz commented Nov 26, 2024

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024 •

edited

Loading

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 29, 2024

abhishekkrthakur commented Nov 29, 2024

abhishekkrthakur commented Nov 29, 2024

[BUG]Autotrain dgx not working for DPO and ORPO #815

[BUG]Autotrain dgx not working for DPO and ORPO #815

Comments

jmparejaz commented Nov 26, 2024

Prerequisites

Backend

Interface Used

CLI Command

UI Screenshots & Parameters

Error Logs

Additional Information

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024 • edited Loading

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 26, 2024

abhishekkrthakur commented Nov 26, 2024

jmparejaz commented Nov 29, 2024

abhishekkrthakur commented Nov 29, 2024

abhishekkrthakur commented Nov 29, 2024

jmparejaz commented Nov 26, 2024 •

edited

Loading