This is a README for the experimental code of the following paper
Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon
KDD 2020
Latest implementation (faster training with stronger performance) of X-Transformer is available at PECOS, feel free to try it out!
> conda env create -f environment.yml
> source activate pt1.2_xmlc_transformer
> (pt1.2_xmlc_transformer) pip install -e .
> (pt1.2_xmlc_transformer) python setup.py install --force
**Notice: the following examples are executed under the > (pt1.2_xmlc_transformer)
conda virtual environment
We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.
Change directory into ./datasets folder, download and unzip each dataset
cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../
Each dataset contains the following files
label_map.txt
: each line is the raw text of the labeltrain_raw_text.txt, test_raw_text.txt
: each line is the raw text of the instanceX.trn.npz, X.tst.npz
: instance's embedding matrix (either sparse TF-IDF or fine-tuned dense embedding)Y.trn.npz, Y.tst.npz
: instance-to-label assignment matrix
Change directory into ./pretrained_models folder, download and unzip models for each dataset
cd ./pretrained_models
bash download-models.sh Eurlex-4K
bash download-models.sh Wiki10-31K
bash download-models.sh AmazonCat-13K
bash download-models.sh Wiki-500K
cd ../
Each folder has the following strcture
proc_data
: a sub-folder containing: X.{trn|tst}.{model}.128.pkl, C.{label-emb}.npz, L.{label-emb}.npzpifa-tfidf-s0
: a sub-folder containing indexer and matcherpifa-neural-s0
: a sub-folder containing indexer and matchertext-emb-s0
: a sub-folder containing indexer and matcher
Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:
bash eval_linear.sh ${DATASET} ${VERSION}
DATASET
: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.VERSION
: v0=sparse TF-IDF features. v1=sparse TF-IDF features concatenate with dense fine-tuned XLNet embedding.
The evaluaiton results should located at
./results_linear/${DATASET}.${VERSION}.txt
Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:
bash eval_transformer.sh ${DATASET}
DATASET
: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.
The evaluaiton results should located at
./results_transformer/${DATASET}.final.txt
The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type).
For simplicity, we show you 1 out-of 9 here, using LABEL_EMB=pifa-tfidf
and MODEL_TYPE=bert
.
We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:
X.trn.npz
: the instance TF-IDF feature matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, D_tfidf), where N_trn is the number of train instances and D_tfidf is the number of features.X.tst.npz
: the instance TF-IDF feature matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, D_tfidf), where N_tst is the number of test instances and D_tfidf is the number of features.Y.trn.npz
: the instance-to-label matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, L), where n_trn is the number of train instances and L is the number of labels.Y.tst.npz
: the instance-to-label matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, L), where n_tst is the number of test instances and L is the number of labels.train_raw_texts.txt
: The raw text of the train set.test_raw_texts.txt
: The raw text of the test set.label_map.txt
: the label's text description.
Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker.
In stage 1, we will do the following
- (1) construct label embedding
- (2) perform hierarchical 2-means and output the instance-to-cluster assignment matrix
- (3) preprocess the input and output for training Transformer models.
TLDR: we combine and summarize (1),(2),(3) into two scripts: run_preprocess_label.sh
and run_preprocess_feat.sh
. See more detailed explaination in the following.
(1) To construct label embedding,
OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
mkdir -p ${PROC_DATA_DIR}
python -m xbert.preprocess \
--do_label_embedding \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-l ${LABEL_EMB} \
-x ${LABEL_EMB_INST_PATH}
DATA_DIR
: ./datasets/Eurlex-4KPROC_DATA_DIR
: ./save_models/Eurlex-4K/proc_dataLABEL_EMB
: pifa-tfidf (you can also try text-emb or pifa-neural if you have fine-tuned instance embeddings)LABEL_EMB_INST_PATH
: ./datasets/Eurlex-4K/X.trn.npz
This should yield L.${LABEL_EMB}.npz
in the PROC_DATA_DIR
.
(2) To perform hierarchical 2-means,
SEED_LIST=( 0 1 2 )
for SEED in "${SEED_LIST[@]}"; do
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.indexer \
python -m xbert.preprocess \
-i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \
-o ${INDEXER_DIR} --seed ${SEED}
This should yield code.npz
in the INDEXIER_DIR
.
(3) To preprocess input and output for Transformer models,
SEED=0
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.preprocess \
--do_proc_label \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-l ${LABEL_EMB_NAME} \
-c ${INDEXER_DIR}/code.npz
This should yield the instance-to-cluster matrix C.trn.npz
and C.tst.npz
in the PROC_DATA_DIR
.
OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
python -u -m xbert.preprocess \
--do_proc_feat \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-m ${MODEL_TYPE} \
-n ${MODEL_NAME} \
--max_xseq_len ${MAX_XSEQ_LEN} \
|& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt
MODEL_TYPE
: bert (or roberta, xlnet)MODEL_NAME
: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)MAX_XSEQ_LEN
: maximum number of tokens, we set to 128
This should yield X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt
and X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt
in the PROC_DATA_DIR
.
In stage 2, we will do the following
- (1) train deep Transformer models to map instances to the induced clusters
- (2) output the predicted cluster scores and fine-tune instance embeddings
TLDR: run_transformer_train.sh
. See more detailed explaination in the following.
(1) Assume we have 8 Nvidia V100 GPUs. To train the models,
MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}
mkdir -p ${MODEL_DIR}
python -m torch.distributed.launch \
--nproc_per_node 8 xbert/transformer.py \
-m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \
-x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
-o ${MODEL_DIR} --overwrite_output_dir \
--per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \
--gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
--max_steps ${MAX_STEPS} \
--warmup_steps ${WARMUP_STEPS} \
--learning_rate ${LEARNING_RATE} \
--logging_steps ${LOGGING_STEPS} \
|& tee ${MODEL_DIR}/log.txt
MODEL_TYPE
: bert (or roberta, xlnet)MODEL_NAME
: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)PER_DEVICE_TRN_BSZ
: 16 if using Nvidia V100 (or set to 8 if using Nvidia 2080Ti)GRAD_ACCU_STEPS
: 2 if using Nvidia V100 (or set to 4 if using Nvidia 2080Ti)MAX_STEPS
: set to 1,000 for Eurlex-4K. Depending on your datasetsWARMUP_STEPS
: set to 1,00 for Eurlex-4K. Depending on your datasetsLEARNING_RATE
: set to 5e-5 for Eurlex-4K. Depending on your datasetsLOGGING_STEPS
: set to 100
(2) To generate predictions and instance embedding,
GPID=0,1,2,3,4,5,6,7
PER_DEVICE_VAL_BSZ=32
CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py
-m ${MODEL_TYPE} -n ${MODEL_NAME} \
--do_eval -o ${MODEL_DIR} \
-x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
-x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \
--per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}
This should yield the following output in the MODEL_DIR
C_trn_pred.npz
andC_tst_pred.npz
: model-predicted cluster scorestrn_embeddings.npy
andtst_embeddings.npy
: fine-tuned instance embeddings
In stage 3, we will do the following
- (1) train linear rankers to map instances and predicted cluster scores to label scores
- (2) output top-k predicted labels
TLDR: run_transformer_predict.sh
. See more detailed explaination in the following.
(1) To train linear rankers,
LABEL_NAME=pifa-tfidf-s0
MODEL_NAME=bert-large-cased-whole-word-masking
OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}
INDEXER_DIR=${OUTPUT_DIR}/indexer
MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}
RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}
mkdir -p ${RANKER_DIR}
python -m xbert.ranker train \
-x1 ${DATA_DIR}/X.trn.npz \
-x2 ${MATCHER_DIR}/trn_embeddings.npy \
-y ${DATA_DIR}/Y.trn.npz \
-z ${MATCHER_DIR}/C_trn_pred.npz \
-c ${INDEXER_DIR}/code.npz \
-o ${RANKER_DIR} -t 0.01 \
-f 0 --mode ranker
(2) To predict the final top-k labels,
PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz
python -m xbert.ranker predict \
-m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \
-x1 ${DATA_DIR}/X.tst.npz \
-x2 ${MATCHER_DIR}/tst_embeddings.npy \
-y ${DATA_DIR}/Y.tst.npz \
-z ${MATCHER_DIR}/C_tst_pred.npz \
-f 0 -t noop
This should yield the predicted top-k labels tst.pred.npz specified in PRED_NPZ_PATH
.
Some portions of this repo is borrowed from the following repos: