Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce the paper results? #6387

Open
1 task done
StiphyJay opened this issue Dec 19, 2024 · 0 comments
Open
1 task done

How to reproduce the paper results? #6387

StiphyJay opened this issue Dec 19, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@StiphyJay
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.31
  • Python version: 3.10.12
  • PyTorch version: 2.4.1+cu121 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090
  • Bitsandbytes version: 0.45.0

Reproduction

how to reproduce the results about table 4 and table 5 based on the newest codebase

Expected behavior

作者是否可以提供1/2个可以复现出论文中table4/table5结果的case 用于快速复现和对比结果。我目前采用llama3-8b模型去采用table5的方式微调和评估,但整体结果似乎比论文中高出很多?不知是否合理

以下是我的训练和评估脚本设置:
sft train:

model_name_or_path: /data01/llama3-8b-instruct-hf #meta-llama/Meta-Llama-3-8B-Instruct
trust_remote_code: true
#method
stage: sft
do_train: true #true
finetuning_type: lora
lora_target: all
#dataset
dataset: xsum_tiny
template: llama3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
#output
output_dir: saves/llama3-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
#train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
#eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500`

sft eval:

model_name_or_path: /data01/llama3-8b-instruct-hf #meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft
trust_remote_code: true
#method
stage: sft
do_predict: true
finetuning_type: lora
#dataset
eval_dataset: xsum_tiny
template: llama3
cutoff_len: 2048
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16
#output
output_dir: saves/llama3-8b/lora/predict_sft_xsum_llama3_8B
overwrite_output_dir: true
#eval
per_device_eval_batch_size: 1
predict_with_generate: true
ddp_timeout: 180000000`

the final result is:
"predict_bleu-4": 53.501343999999996,
"predict_model_preparation_time": 0.0046,
"predict_rouge-1": 54.96382,
"predict_rouge-2": 33.267082,
"predict_rouge-l": 47.676412,
"predict_runtime": 52.8157,
"predict_samples_per_second": 0.947,
"predict_steps_per_second": 0.947

so the corresponding results is: (54.96+33.26+47.67)/3=45.3, however the corresponding results in paper table 5 is 30.63 for LoRA + Llama3-8B. There looks like have a large differences?

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant