Post

Overview of transformers library

The basic setup involves the following steps

(Raw data) —> Tokenizer (input ids) —> Model (logits) —> Post processing —> prediction

Tokenizer

Transformers can not process text input, so they must be converted to numbers called token ids

  • Tokens: splitting the input into words, sub words, symbols.
  • Map such token to integers
  • other inputs such as attention masks, etc. AutoTokenizer is a library that contains various tokenizers used in different models. A tokenizer’s algorithm and vocabulary can be loaded using from_pretrained(“modelname”) method. Similarly, tokenizer can be saved using save_pretrained() method. Besides, tokenizers can be loaded directly from a specific model such as BertTokenizer
1
2
3
    from Transformers import AutoTokenizer
    checkpoint = "albert"
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    from transformers import BertTokenizer
    from transformers import AutoTokenizer
    ## encoding
    checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
    tokenizer1 = BertTokenizer.from_pretrained("bert-base-cased")
    tokenizer2 = AutoTokenizer.from_pretrained(checkpoint)
    tokens1 = tokenizer1("This is a test sentence for tokenization", padding=True, truncation=True, return_tensors="pt")
    tokens2 = tokenizer2.tokenize("This is a test sentence for tokenization")
    print(tokens1)
    print(tokens2)
    ids = tokenizer.convert_tokens_to_ids(tokens)
    print(ids)
    # decoding
    string_decoded = tokenizer.decode([2023, 2003, 1037, 3231, 2741, 6651, 2005, 19204, 3989])
    print(string_decoded)

Model

Similar to Tokenizer, a model can be loaded using a specific modelname or instantiated using AutoModel class.

1
2
3
4
    from transformers import BertConfig, BertModel
    config = BertConfig()
    model = BertModel(config)
    print(config)

Alternative loading method

1
2
3
4
    from transformers import AutoModel
    Model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
    config = Model.config
    print(config)

Gated Models

1
2
3
4
   from transformers import AutoModel
   Model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
   config = Model.config
   print(config)
This post is licensed under CC BY 4.0 by the author.