Snippext is the extraction pipeline for mining opinions and customer experiences from user-generated content (e.g., online reviews).
Paper: Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan, "Snippext: Semi-supervised Opinion Mining with Augmented Data", In theWebConf (WWW) 2020
- Python 3.7.5
- PyTorch 1.3
- HuggingFace Transformers
- Spacy with the
em_core_web_sm
models - NLTK (stopwords, wordnet)
- Gensim
- NVIDIA Apex (fp16 training)
Install required packages
conda install -c conda-forge nvidia-apex
pip install -r requirements.txt
Download pre-trained BERT models and word2vec models (for data augmentation) :
wget https://snippext.s3.us-east-2.amazonaws.com/finetuned_bert.zip
unzip finetuned_bert.zip
wget https://snippext.s3.us-east-2.amazonaws.com/word2vec.zip
unzip word2vec.zip
The baseline method performs BERT finetuning on a specific task:
CUDA_VISIBLE_DEVICES=0 python train_baseline.py \
--task restaurant_ae_tagging \
--logdir results/ \
--save_model \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 20 \
--bert_path finetuned_bert/rest_model.bin
Parameters:
--task
: the name of the task (defined inconfigs.json
)--logdir
: the logging directory with Tensorboard--save_model
: whether to save the best model--batch_size
,--lr
,--n_epochs
: batch size, learning rate, and the number of epochs--bert_path
(Optional): the path of a fine-tuned BERT checkpoint. Use the base uncased model if not specified.--max_len
(Optional): maximum sequence length
(New) (also in MixDA and MixMatchNL):
--fp16
(Optional): whether to train with fp16 acceleration--lm
(Optional): other language models, e.g., "distilbert" or "albert"
The train/dev/test sets of a task (tagging or span classification) are specificed in the file configs.json
.
The file configs.json
is a list of entries where each one is of the following format:
{
"name": "hotel_tagging",
"task_type": "tagging",
"vocab": [
"B-AS",
"I-AS",
"B-OP",
"I-OP"
],
"trainset": "combined_data/hotel/train.txt.combined",
"validset": "combined_data/hotel/dev.txt",
"testset": "combined_data/hotel/dev.txt",
"unlabeled": "combined_data/hotel/unlabeled.txt"
},
Fields:
name
: the name of the task. A tagging task should end with a suffix_tagging
task_type
: eithertagging
orclassification
vocab
: the list of class labels. For tagging tasks, all labels start withB-
orI-
indicating the begin/end of a span. For classification task, the list contains all the possible class labels.trainset
,validset
,testset
: the paths to the train/dev/test setsunlabeled
(Optional): the path to the unlabeled dataset for semi-supervised learning. The file has same format as the train/test sets but the labels are simply ignored.
- Build the augmentation index:
python augment_index_builder.py \
--task restaurant_ae_tagging \
--w2v_path word2vec/rest_w2v.model \
--idf_path word2vec/rest_finetune.txt \
--bert_path finetuned_bert/rest_model.bin \
--index_output_path augment/rest_index.json
Simply replace restaurant_ae_tagging
with laptop_ae_tagging
, restaurant_asc
, and laptop_asc
to generate the other indices.
Replace rest
with laptop
for the laptop_ae_tagging
and laptop_asc
indices.
- Train with:
CUDA_VISIBLE_DEVICES=0 python train_mixda.py \
--task restaurant_ae_tagging \
--logdir results/ \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 5 \
--bert_path finetuned_bert/rest_model.bin \
--alpha_aug 0.8 \
--augment_index augment/rest_index.json \
--augment_op token_repl_tfidf \
--run_id 0
Parameters:
alpha_aug
: the mixup parameter between the original example and the augmented example ({0.2, 0.5, 0.8} are usually good).augment_index
: the path to the augmentation indicesaugment_op
: the name of the DA operators. We currently support the following 9 operators:
Operators | Details |
---|---|
token_del_tfidf | Token deletion by importance (measured by TF-IDF) |
token_del | Token deletion (uniform) |
token_repl_tfidf | Token replacement by importance |
token_repl | Token replacement (uniform) |
token_swap | Swapping two tokens |
token_ins | Inserting new tokens |
span_sim | Replacing a span with similar a one |
span_freq | Replacing a span by frequency |
span | Uniform span replacement |
Our implementation of MixMatch with MixDA. To train with MixMatchNL:
CUDA_VISIBLE_DEVICES=0 python train_mixmatchnl.py \
--task restaurant_ae_tagging \
--logdir results/ \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 5 \
--bert_path finetuned_bert/rest_model.bin \
--alpha_aug 0.8 \
--alpha 0.2 \
--u_lambda 50.0 \
--num_aug 2 \
--augment_index augment/rest_index.json \
--augment_op token_repl_tfidf \
--run_id 0
Additional parameters:
alpha
: the mixup parameter (between labeled and unlabeled data). We chosealpha
from {0.2, 0.5, 0.8}.u_lambda
: the weight of unlabeled data loss, typically (chosen from {10.0, 25.0, 50.0})num_aug
: the number of augmented examples per unlabeled example (we chose from {2, 4})