-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for NLP pre-processing and feature engineering #14
Comments
Totally agree with the paraphrase issue! We have been brainstorming about this in the lab as well. Simply mapping OOV words to something within the vocabulary is a good place to start.
I am not sure I would agree. As our paper explains, most answers in our dataset are 1-3 words long, so it really is /mostly/ a large multiclass classification problem. Our choice of 1K answers in the model is simply one convenient choice. It covers ~82% of all answers.
No, but how would such a system be trained by backprop? 1-NN isn't amenable to gradient-based learning. |
I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial. I think the coverage you observe is mostly a fact about the collection methodology, not about language in general. Re training: Off the top of my head, maybe noise contrastive estimation? That's how the QANTA paper did it. |
:-). I would counter that just because the space of answers is small does not make the learning problem easy. Even binary questions such as "Is this person expecting company?" can require fairly heavy lifting on the vision/reasoning side. |
Hey, I'm not saying it's easy, or that it's not impressive and interesting :) But a fixed answer vocabulary isn't the future of this task. I think the technology would take a big step towards practicality if you were learning to produce a meaning representation. That way, to learn a new answer, you just have to learn its vector. If you add another class to the model, you don't know how many weights might have to be adjusted. Probably a lot. My hunch is that it would actually be better for accuracy, too. But, you know the evaluation much better. |
Agreed. |
:) Now, about the paraphrases. There are a couple of ways we could do this.
I think 3 and 4 make the most sense. |
I like 4 the best. @jiasenlu @abhshkdz @dexter1691: Do you have a preference? |
I agree, 3 and 4 would be great. |
yeah. 3 and 4 makes sense. |
Great. I left a job running overnight compiling the paraphrases from GloVe, but I messed something up. Restarting now. Here's some random substitutions for relatively frequent words. Some of these substitutions look good, others look pretty problematic.
I think it will help a lot to have vectors over POS tagged text. Then we can make sure we don't change the part-of-speech when we do the replacement. |
The instructions in the readme don't specify a frequency threshold for the training data. Is this the config you've been running in your experiments? Below are the number of excluded words at various thresholds. With a frequency threshold of 1 or 5, there seems relatively little advantage to having a complicated replacement scheme. Only 1% of the tokens are affected. With so many other moving parts and difficulties of the task, I doubt changing the representation of those tokens would help very much. I would be curious to try an aggressive threshold, leaving only one or two thousand words in the vocab.
|
@honnibal I use the th = 0 (preserve all the words) in the pre-trained model. Previous I tried using only top 1000 words in our bag-of-words baseline(between th=50~100). And it seems the performance is much worse than method here[(http://arxiv.org/abs/1512.02167)], which has the similar network structure. So I doubt using aggressive threshold will improve the performance. |
I think replacing the tokens with It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess. Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better. It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.
|
Yes, I agree. I think will this will help.
I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.
Yes, agree
Yeah, I also tried to do some experiment on paraphrase of Question, and we can discuss with this more if you are interested. Jiasen |
I think the random vector seems potentially problematic. It could be like replacing the word with a random one from your vocabulary. You could get a vector that's close to or identical some common word. Maybe empirically it makes no difference. I'm always trying to replace experiments with intuition though :). I find I always have too many experiments to run, so I'm always trying to make these guesses. I've made a pull request that gives you a |
Example output at threshold 50 below. I expect much lower thresholds to perform better, but it's harder to see the paraphrasing working when relatively fewer tokens are replaced.
|
Hi all,
I'm excited to do some work on the text processing side of the Visual QA task. I develop the spaCy NLP library. I think we should be able to get some extra accuracy, with some extra NLP logic on the question parsing side. We'll see.
The first thing I'd like to try is mapping out of vocabulary words to similar tokens, using a word2vec model. For instance, let's say the word colour is OOV. Seems easy to map this to color.
I think this input normalization trick is novel, but it makes sense to me for this problem. It lets you exploit pre-trained vectors without interfering with the rest of your model choices.
I think the normalization could be taken a bit further, by using the POS tagger and parser to compute context-specific keys, so that the replacement could be more exact (sense2vec). I think just the word replacement is probably okay though.
It's also easy to calculate auxiliary features with spaCy. It's easy to train a question classifier, of course. I'm not sure the model is making many errors of that type, though.
If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output. Have you tried having the model output a vector, and using it to find a nearest neighbour?
The text was updated successfully, but these errors were encountered: