Investigation of the uses of sequence-to-sequence neural networks (s2s) on dependency parsing. On-going experiments show very interesting results and a promising direction.
If you don't know what dependency parsing is there is an excellent resource : Dependency Parsing from Kübler, McDonald, and Nivre.
If you want to learn about s2s models you may want to read these : Kalchbrenner and Blunsom, Cho et. al., Sutskever et. al., and Vinyals et. al.
First you need to convert word embeddings into a pkl file. Under parser/
folder. Let's use the sample word vector file under embeddings/
scripts/vector2pkl.py ../../embeddings/scode100.embeddings ../../embeddings/scode100.pkl 0 *UNKNOWN*
This will create a pickle file for word embeddings. It will not skip the first line (beware 0
, some word vectors have meta-data in the first line, I dunno why). It will has a special tag for unknown words *UNKNOWN*
.
Please take a look at data/
. You will find samples of conll formatted files. Setup your data like that. I use the experimental setup of this paper.
I implemented attention and pointer models on top of old Keras backend (before TensorFlow changes). You need to checkout my keras fork and put keras folder under parser/
.
Under parser/
if you type following command you will see a bunch of arguments.
python train_parser.py
--batch-size BATCH_SIZE batch-size , default 64
--epochs N_EPOCHS # of epochs, default = 500
--patience PATIENCE # of epochs for patience, default = 10
--model MODEL model type {enc2dec, attention, pointer},default = pointer
--unit UNIT train with {lstm gru rnn} units,default = lstm
--hidden N_HIDDEN hidden size of neural networks, default = 256
--layers N_LAYERS # of hidden layers, default = 1
--train TR training conll file
--val VAL validation conll file
--test TEST test conll file
--prefix PREFIX exp log prefix to append exp/{} default = 0
--vector VECTOR vector pkl file
Most of above are trivial. --model
is to choose among s2s models. enc2dec
is very similar to Cho et. al.. attention
is for Bahdanau et. al., and pointer
is for Vinyals et. al.. You may want to take a look at the architecture through get_model.py
and their backend implementation of attention layer.
--prefix
is for creating a subfolder under exp/
.
Let's say you type this command:
python train_parser.py --model pointer --hidden 128 --layers 2 --train ../data/ptb.train.conll --val ../data/ptb.val.conll --test ../data/ptb.test.conll --vector ../embeddings/word2vec300.pkl --prefix PTB
This will train a 2 layer 128 LSTM unit pointer parser model under exp/PTB
. Under this folder you will expect to see Mpointer_Vword2vec_Ulstm_H128_L2_<TIME>
(M is for model V is for word vector U is for rnn unit H is for width of model L is for depth of model is the starting time of the experiment, lots of meta-data! ) with extensions:
.arch
: for architecture of the model
.meta
: for meta data about the setup and training
.model
: trained model
.output
: prediction of test file without the decoder (pure NN output)
.decoded
: prediction of test file using the model and a dependency decoder.
.val_eval
: the last training epoch's validation score
.validation
: the last training epoch's validation prediction (pure NN)
To evaluate using conll's original script first convert the output to conll format using script/output2conll.py
and use scripts/conll07.pl
. Note that people generally use -p
option to ignore punctuation.
If you have any problems or ideas to improve this setup contact me.
- write a better readme
- experiment -> latex table script
- argument option : ignore pos tags or not