-
Notifications
You must be signed in to change notification settings - Fork 579
ceval_en
This project tested the performance of the relevant models on the C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects.
In the following, we will introduce the prediction method for the C-Eval dataset.
Download the dataset from the path specified in official C-Eval, and unzip the file to the data
folder:
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data
Move data
to scripts/ceval
directory of this project.
Run the following script:
model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/your_output_dir
cd scripts/ceval
python eval.py \
--model_path ${model_path} \
--cot False \
--few_shot False \
--with_prompt True \
--constrained_decoding True \
--temperature 0.2 \
--n_times 1 \
--ntrain 5 \
--do_save_csv False \
--do_test False \
--output_dir ${output_path} \
-
model_path
: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA) -
cot
: Whether to use chain-of-thought -
few_shot
: Whether to use few-shot -
ntrain
: Specifies the number of few-shot demos whenfew_shot=True
(5-shot:ntrain=5
); Whenfew_shot=False
, this argument does not have any effect -
with_prompt
: Whether input to the model contains the instruction prompt for Alpaca-2 models -
constrained_decoding
: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:-
constrained_decoding=True
: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer -
constrained_decoding=False
: Extract the answer token from model's outputs with regular expressions
-
-
temperature
: Temperature for decoding -
n_times
: The number of repeated evaluations. Folders will be generated underoutput_dir
corresponding to the specified number of times -
do_save_csv
: Whether to save the model outputs, extracted answers, etc. in csv files -
output_dir
: Output path of results -
do_test
: Whether to evaluate on the valid or test set: evaluate on the valid set whendo_test=False
and on the test set whendo_test=True
-
The evaluation script creates directories
outputs\take*
when finishing evaluation, where*
is a number ranges from 0 ton_times-1
, storing the results of then_times
repeated evaluations respectively. -
In each
outputs\take*
, there will be asubmission.json
and asummary.json
. Ifdo_save_csv=True
, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.
-
submission.json
stores generated answers in the official submission form, and can be submitted for evaluation:
{
"computer_network": {
"0": "A",
"1": "B",
...
},
"marxism": {
"0": "B",
"1": "A",
...
},
...
}
-
summary.json
stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:"All": { "score": 0.35958395, "num": 1346, "correct": 484.0 }
where score
is the overall accuracy, num
is the total number of evaluation examples, and correct
is the number of correct predictions.
do_test=True
), score
and correct
are 0 since there are no labels available. The test set results require submitting the submission.json
file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.