-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistency with the article (HATT) #24
Comments
every sequence of LSTM output is a 2D, and the context vector is 1D. The product of them is 1D. The context vector is trained to assign weight to the 2D so that you can think of it as a weighted vector, such that ideally, it will give more weight to the important token. |
Hi, thanks for your answer. According to the code, we actually stack two linear operations on the output of the GRU layer - first the Dense layer, and then the dot multiplication with self.W, without non-linearity in between. Theoretically, this could be converted with a single linear layer (as explained here). Again, maybe I miss something, will be glad for an explanation :) |
Which equation are you referring to? The tanh activation at my code refers to equation (5) and (8). h_it is from GRU output. |
I'll try to be as rigorous as possible:
These are lines 194-196 in the code, referring to the upper hierarchy layer.
And these are equations 5, 6 from the paper. The case is the same for lines 187-189 in the code and for equations 8-10, however I'll demonstrate only on these parts. As you've said, h_it is the GRU output. In line 195, it is being passed through a Dense layer, therefore implementing the According to the code, this output is now passed through the Attention layer. Note we do not have any activation in line 195, so we proceed only with the inner linear part of equation 5, rather than with u_it. More specifically, the next operation takes place in the call() method of the layer:
x being the input of the layer, or literally Only then, by line 174 in the code, we apply the tanh on the product. Note that this product is directly inserted to the exp in equation 6, without any non-linearities in between. To my understanding, this is a different procedure than the one practiced in the paper. I may be wrong, or possibly this somehow leads to similar behavior, but I'd just like to hear why :) Thanks! |
Ha, you found a HUGE bug in my code that I didn't realize. I'm quite sure you are the first one to point out even someone asked why I use time distributed dense function (depricated). The bug is I placed the tanh in the wrong place and wrong order. The TimeDistributed(Dense(200))(l_lstm_sent) is intended to do a one layer MLP, and as you said, there should be a tanh activation function before the dot product. The solution is either
It has been so long that I have to reread the paper to bring backs the memory. I hope I didn't make mistakes again. Let me know :) |
I've seen some discussion about it, but I'm afraid I still don't get it:
The tanh activation is applied in the original paper over an MLP layer which accepts only the bilstm vector as input (eq 5).
Assuming self.W is the context vector in our case, then tanh is applied on the multiplication of the bilstm vector with the context vector (the Dense layer does not have activation of itself).
What is the explanation for this?
Thanks!
The text was updated successfully, but these errors were encountered: