Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using custom model for input #2

Open
chiehminwei opened this issue Dec 9, 2018 · 5 comments
Open

Using custom model for input #2

chiehminwei opened this issue Dec 9, 2018 · 5 comments

Comments

@chiehminwei
Copy link

I have trained a supertagger using BERT. It takes in a sentence and outputs a softmax for each word in the sentence.
I want to use your A* parsing method to parse the sentence. How can I use my trained supertagger and combine it with your A* parser? Thank you so much.

@masashi-y
Copy link
Owner

masashi-y commented Dec 10, 2018

Hi. I agree that making the use of external supertaggers easier is an important feature that should be implemented. As you may know, my parser requires probabilities of dependency structure in addition to those of supertags. Recently, I updated Python version of depccg (found at https://github.com/masashi-y/depccg. Sorry for the complicatedness...), so that it can accept input file in jsonl format where each line corresponds to a json dictionary containing infos about a sentence, that should look like as follow:

{
    "words": "this is an example sentence .",
    "head_tags": [0.0, ...],    # flattened matrix (list) which contains # words * # supertags elements
    "head_tags_shape: [# words, # supertags],
    "heads": [0.0, ...],    # flattened matrix with # words * (# words + 1) elements   
    "heads_shape": [# words, # words + 1],
    "categories" ["S", "N", ...]  # a list supertags
}

Optionally, you can omit head_shape and heads entries, in which case the parser uses the default dependency parser implemented in it to assign probabilities of the dependency structure.

Please try the following scripts and check the attached json file to see if it works.

git clone https://github.com/masashi-y/depccg
cd depccg
git checkout refactor
python setup.py build_ext --inplace  # or you can install by  "python setup.py install"
sh bin/depccg_en download     # install default tagger
cat test.json | sh bin/depccg_en --input-format json

Please be careful to use log probabilities (e.g. log_softmax). The scale mismatch may cause unexpected bahavior of the parser.
I'm happy to help if you see any problem.
Hope that this meets your need.
test.json.gz

@chiehminwei
Copy link
Author

I tested the json file and it worked! Thank you so much!

I just re-read your paper and some of your code (ja_lstm_parser_bi.py) and seem to have a better understanding now. You jointly trained for P_tag and P_dep terms using a biLSTM. I still have some questions though.

How does the default dependency parser assign dependency probabilities if I don't include "heads" and "heads_shape" fields in the json file? I'm guessing it loads the pre-trained biLSTM and do a forward prop to obtain the P_dep terms? In that case, if I wanted to parse Chinese, I guess I would need to train my own model to get the P_dep terms? Alternatively, if I just assigned equal probabilities to each P_dep (say, set "heads" to [# words, #words+1] with all -1's) as a quick hack, would this be equivalent to using Mike Lewis's EasySRL parser? Also, why does "heads" have shape [# words, # words + 1]?

I'm trying to parse the Chinese CCGBank, so I would highly appreciate your tips on any challenges you faced (Japanese is strictly head-final, no tri-training available, had to convert to bunsetsu dependency for evaluation...could you still use EASYCCG for Japanese evaluation?) or any reference that might be helpful (for example, how can I use the script for converting from CCGBank to dependency? Any code in this repo/python repo that I should look at more closely?).

Sorry I have so many questions. I'm so glad there's someone working on Japanese CCG though. Thank you so much again for your help!

@masashi-y
Copy link
Owner

I'm happy too, that you are working on Chinese CCG parsing!

How does the default dependency parser assign dependency probabilities if I don't include "heads" and "heads_shape" fields in the json file? I'm guessing it loads the pre-trained biLSTM and do a forward prop to obtain the P_dep terms?

Exactly yes. It loads pre-trained biLSTM trained using tri-training.
The performance is the best performing one in my CCG paper.

In that case, if I wanted to parse Chinese, I guess I would need to train my own model to get the P_dep terms?

Ideally yes.

Alternatively, if I just assigned equal probabilities to each P_dep (say, set "heads" to [# words, #words+1] with all -1's) as a quick hack, would this be equivalent to using Mike Lewis's EasySRL parser?

If you set P_dep in that way, it becomes very close to EasySRL. But it lacks a heuristic method that is introduced in Mike Lewis's paper (e.g. what he calls attach-low heuristics). So the performance will be poorer in the case of English parsing.
(One of the claims in my paper is that for languages such as Japanese, which has relatively freer word ordering, this kind of heuristics does not help. And we found that it is important to model higher non-terminal level syntax (dependencies) in addition to supertags. I don't know if this is the case for Chinese CCGbank parsing..)

Also, why does "heads" have shape [# words, # words + 1]?

heads[i, 0] is a log probability that (i+1)'th word is connected to dummy root node in a dependency tree.
heads[i, j] (j>0) is one that (i+1)'th word is a dependency child of j'th word.

I'm trying to parse the Chinese CCGBank, so I would highly appreciate your tips on any challenges you faced (Japanese is strictly head-final, no tri-training available, had to convert to bunsetsu dependency for evaluation...could you still use EASYCCG for Japanese evaluation?) or any reference that might be helpful (for example, how can I use the script for converting from CCGBank to dependency? Any code in this repo/python repo that I should look at more closely?).

Regarding tri-training, I did not do it for Japanese language simply because I did not have enough time for that. You download some huge data (e.g. wikipedia) and assign supertags and dependency trees using some taggers and dep-parsers, which is just an engineering work.
Converting a CCG tree to a dependency one is very simple if you know the algorithm how Stanford converter converts a constituency tree to dependency one. Googling e.g. converting constituency tree to dependency tree will be helpful 😃 In my code, https://github.com/masashi-y/depccg/blob/master/src/py/ja_lstm_parser_bi.py contains a function that do this: TraininingDataCreator._get_dependencies. We also use the similar algorithm in the conversion to the bunsetsu dependencies.

@chiehminwei
Copy link
Author

Thank you so much for your thorough explanations. Really appreciate your help!
One more question, if I provided the probabilities in json, would the parsing results vary depending on whether it is using the Japanese model or English model? Or is it the case that once I have the probabilities, then parsing is deterministic regardless of the input language?
Thank you so much again!

@masashi-y
Copy link
Owner

Hi. Sorry to reply late. Unfortunately there are some language-specific configurations. Most importantly, you must define a set of combinatory rules and a set of unary rules for a specific language. You can find the sets of combinatory rules implemented for Japanese and English languages at https://github.com/masashi-y/depccg.ml/blob/master/lib/grammar.ml, and you can find the sets of unary rules in the zipped model files (unary_rules.txt file). Additionaly, in the zip archive you can see other things such as seen_rules.txt and cat_dict.txt, which is used to reduce the search space in runnning A* algorithm. You should configure so that the parser does not use them, or it would be nice if you create one for Chinese language for the efficiency sake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants