-
Notifications
You must be signed in to change notification settings - Fork 579
cmmlu_en
This project tested the performance of the relevant models on the CMMLU benchmark dataset. The test set consists of 11K multiple-choice questions covering 67 subjects.
In the following, we will introduce the prediction method for the CMMLU dataset.
Download the dataset from the path specified in official CMMLU, and unzip the file to the data folder:
wget https://huggingface.co/datasets/haonan-li/cmmlu/resolve/main/cmmlu_v1_0_1.zip
unzip cmmlu_v1_0_1.zip -d data
Move data to scripts/cmmlu directory of this project.
Run the following script:
model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/your_output_dir
cd scripts/cmmlu
python eval.py \
--model_path ${model_path} \
--few_shot False \
--with_prompt True \
--constrained_decoding True \
--output_dir ${output_path} \
--input_dir data
-
model_path: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA)
-
cot: Whether to use chain-of-thought
-
few_shot: Whether to use few-shot
-
ntrain: Specifies the number of few-shot demos when few_shot=True (5-shot: ntrain=5); When few_shot=False, this argument does not have any effect
-
with_prompt: Whether input to the model contains the instruction prompt for Alpaca-2 models
-
constrained_decoding: Since the standard answer format for CMMLU is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:
-
constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer
-
constrained_decoding=False: Extract the answer token from model's outputs with regular expressions
-
-
temperature: Temperature for decoding
-
n_times: The number of repeated evaluations. Folders will be generated under output_dir corresponding to the specified number of times
-
output_dir: Output path of results
-
input_dicr: Path to the CMMLU data
-
The evaluation script creates directories outputs\take* when finishing evaluation, where * is a number ranges from 0 to n_times-1, storing the results of the n_times repeated evaluations respectively.
-
In each outputs\take*, there will be a submission.json and a summary.json .
-
submission.json stores generated answers :
{
"arts": {
"0": "A",
"1": "B",
...
},
"nutration": {
"0": "B",
"1": "A",
...
},
...
}
- summary.json stores model evaluation results under 67 subjects, 5 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:
"All": {
"score": 0.39984458642721465,
"num": 11582,
"correct": 4631.0
}
where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.