You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')
def preprocessing_for_tape(data):
"""Perform required preprocessing steps for pretrained BERT.
@param data (np.array): Array of texts to be processed.
@return input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
@return attention_masks (torch.Tensor): Tensor of indices specifying which
tokens should be attended to by the model.
"""
# Create empty lists to store outputs
input_ids = []
attention_masks = []
# For every sentence...
for sent in data:
# `encode_plus` will:
# (1) Tokenize the sentence
# (2) Add the `[CLS]` and `[SEP]` token to the start and end
# (3) Truncate/Pad sentence to max length
# (4) Map tokens to their IDs
# (5) Create attention mask
# (6) Return a dictionary of outputs
encoded_sent = tokenizer.encode(
sent, # Preprocess sentence
#add_special_tokens=True, # Add `[CLS]` and `[SEP]`
#max_length=MAX_LEN, # Max length to truncate/pad
#pad_to_max_length=True, # Pad sentence to max length
#return_tensors='pt', # Return PyTorch tensor
#return_attention_mask=True,
#truncation=True # Return attention mask
)
# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))
attention_masks.append(encoded_sent.get('attention_mask'))
# Convert lists to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
return input_ids, attention_masks`
Hi! Do you specifically want to re-implement bert-base, or just a transformer? I have code to train a version of ESM-1b here. This code scales better and will also result in better performance.
In that repo, the data processing is done in these lines. The masking code is then implemented in this class.
I have a bunch of utilities implemented in github.com/rmrao/evo, if it's helpful.
If you specifically want the masking code from TAPE, it's implemented here.
Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code
But my output (for example) will have only token ids (no attention mask and no possibility to set max_length or padding).
How does it works? Thanks
The text was updated successfully, but these errors were encountered: