attention masks tokenizer #126

Ch-rode · 2022-03-21T15:52:57Z

Hello ! I'm trying to implement bert-base but I have not clear how do you generate the masks with the TapeTokenizer. This is my code

model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')

def preprocessing_for_tape(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode(
            sent,  # Preprocess sentence
            #add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            #max_length=MAX_LEN,                  # Max length to truncate/pad
            #pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            #return_attention_mask=True,
            #truncation=True     # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
      

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks`

sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor([tokenizer.encode(sequence)])
model = ProteinBertModel.from_pretrained('bert-base')
tokenizer = TAPETokenizer(vocab='iupac')
token_ids

tensor([[ 2, 11,  7, 23, 25,  9,  8, 21,  7, 15, 13, 11, 16, 11,  5, 13, 15, 15,
         17, 11,  7, 25, 13, 11, 22, 11, 22, 15, 25,  5,  5, 11,  5, 15, 13, 23,
         20,  3]])

But my output (for example) will have only token ids (no attention mask and no possibility to set max_length or padding).
How does it works? Thanks

The text was updated successfully, but these errors were encountered:

rmrao · 2022-03-21T18:18:54Z

Hi! Do you specifically want to re-implement bert-base, or just a transformer? I have code to train a version of ESM-1b here. This code scales better and will also result in better performance.

In that repo, the data processing is done in these lines. The masking code is then implemented in this class.

I have a bunch of utilities implemented in github.com/rmrao/evo, if it's helpful.

If you specifically want the masking code from TAPE, it's implemented here.

Hope this helps!

Ch-rode · 2022-03-21T20:52:48Z

Hello ! Thanks for your informations. I would like to re-implement bert-base for Sequence Classification task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attention masks tokenizer #126

attention masks tokenizer #126

Ch-rode commented Mar 21, 2022 •

edited

Loading

rmrao commented Mar 21, 2022

Ch-rode commented Mar 21, 2022

attention masks tokenizer #126

attention masks tokenizer #126

Comments

Ch-rode commented Mar 21, 2022 • edited Loading

rmrao commented Mar 21, 2022

Ch-rode commented Mar 21, 2022

Ch-rode commented Mar 21, 2022 •

edited

Loading