You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, the left-shift is done on a Lx(M+L) (i.e. [qlen, qlen+klen]) matrix, but here it is on a LxM if I'm not mistaken. The upper right relative positional encoding are thus erroneous, no ?
I understand that this is not a problem as these ones are masked afterwards, but if we were not using the mask, the pos_seq should be computed with klen + qlen, and then truncated after left-shift before adding to term AC ?
Or did I miss something ?
The text was updated successfully, but these errors were encountered:
gdoras
changed the title
Question: why is relative positional encoded computed with length M vs. L+M in the paper ?
Question: why is relative positional encoding computed with length M vs. L+M in the paper ?
Mar 18, 2021
The positional encoding in the code is:
Then used to build the
r_head_k
tensor and finally used in the BD term:In the paper, the left-shift is done on a Lx(M+L) (i.e.
[qlen, qlen+klen]
) matrix, but here it is on a LxM if I'm not mistaken. The upper right relative positional encoding are thus erroneous, no ?I understand that this is not a problem as these ones are masked afterwards, but if we were not using the mask, the
pos_seq
should be computed withklen + qlen
, and then truncated after left-shift before adding to termAC
?Or did I miss something ?
The text was updated successfully, but these errors were encountered: