Post

Choosing the Right Transformer: Encoder, Decoder, or Both?

The transformer architecture [1] introduced in 2017 has become the backbone of virtually every modern language model. But not all transformers are built the same way. Depending on how you wire the attention mechanism, you get three fundamentally different architectures — each suited to a different class of problems. Understanding the difference is essential for making the right design choice in your own research or applications.

The Core Difference: How Attention Flows

All three architectures are built from the same transformer block, but they differ in one critical dimension: which tokens each position is allowed to attend to.

Encoder-only (BERT-style)

In an encoder-only model, every token can attend to every other token in the sequence — both to the left and to the right. This is called bidirectional attention. The model reads the entire input simultaneously and builds a rich contextual representation of each token informed by its full surrounding context.

1
2
3
Token:   The   cat   sat   on   the   mat
         ↕↕↕   ↕↕↕   ↕↕↕   ↕↕↕   ↕↕↕   ↕↕↕
         All tokens attend to all other tokens

Well-known examples: BERT [2], RoBERTa, DeBERTa, ELECTRA.

Because the model sees the full context, encoder-only models produce highly informative token and sentence representations — making them excellent for tasks where understanding is the goal rather than generation.

Decoder-only (GPT-style)

In a decoder-only model, each token can only attend to tokens that came before it — never to future tokens. This is called causal or autoregressive attention, enforced by a mask that hides future positions.

1
2
3
Token:   The   cat   sat   on   the   mat
          ↑     ↑↑    ↑↑↑   ↑↑↑↑  ↑↑↑↑↑  ↑↑↑↑↑↑
         Each token only attends to previous tokens

Well-known examples: GPT series [3], LLaMA, Mistral, Qwen, SmolLM2.

The causal constraint makes these models natural text generators — they predict the next token given all previous tokens, which is exactly the training objective. Modern large language models are almost exclusively decoder-only.

Encoder-Decoder (T5/BART-style)

An encoder-decoder model combines both components. The encoder reads the full input with bidirectional attention and produces a contextual representation. The decoder then generates the output token by token, attending both to its own previous outputs (causal) and to the full encoder output via a cross-attention mechanism.

1
2
3
Input  → [Encoder: bidirectional attention] → context vectors
                                                      ↓
Output ← [Decoder: causal + cross-attention] ←───────┘

Well-known examples: T5 [4], BART, mT5, FLAN-T5.

These models are designed for sequence-to-sequence tasks where the input and output are distinct sequences — translation, summarisation, and question answering with a separate context and answer.


Encoder-only vs Decoder-only for Classification

This is where the architectural choice has a direct research implication that is often misunderstood.

For a classification task — sentiment analysis, topic classification, intent detection, domain-specific document classification — the question is: which architecture gives you better representations to classify from?

The case for encoder-only

When classifying a sequence, you want the richest possible representation of that sequence. Bidirectional attention means every token’s representation is informed by the full context — both what came before and what comes after. The [CLS] token in BERT, for example, aggregates information from the entire sequence simultaneously, making it a strong pooled representation for classification.

A decoder-only model, by contrast, can only build each token’s representation from the tokens that came before it. The last token’s representation has seen the full sequence, but only in one direction — it never directly attends to earlier tokens in the context of later ones.

In practice, encoder-only models consistently outperform decoder-only models on classification benchmarks when both are fine-tuned at similar parameter counts. BERT-base (110M) fine-tuned on GLUE tasks [5] outperforms GPT-2 (117M) despite essentially the same size, precisely because bidirectional attention produces richer representations for understanding tasks.

When you would specifically choose encoder-only

In a research scenario, you would choose an encoder-only model over a decoder-only model for classification when:

  • Labelled data is limited — encoder models fine-tune more efficiently on small datasets because the bidirectional pretraining objective (masked language modelling) directly trains the model to understand context, which transfers cleanly to classification
  • Latency matters at inference — encoder models are faster to run for classification since you only need a single forward pass to get the [CLS] representation, whereas using a decoder for classification typically requires generating a label token autoregressively
  • The task is purely discriminative — if you never need to generate text, there is no reason to use a decoder; the causal constraint is an unnecessary limitation
  • Domain-specific understanding is critical — domain-adapted encoder models (BioBERT, LegalBERT, SciBERT) have shown that fine-tuning a bidirectional model on domain text and then classifying outperforms decoder-only approaches at the same scale

The one scenario where you might prefer a decoder-only model for classification is when you already have a large pre-trained decoder (e.g. LLaMA-3-8B) and want to avoid maintaining a separate encoder model. At very large scales (7B+), decoder-only models can match or exceed encoder performance on classification through in-context learning or fine-tuning — but this is a resource argument, not an architectural one.


Quick Decision Guide

TaskBest architectureWhy
Text classificationEncoder-onlyBidirectional context, efficient fine-tuning
Named entity recognitionEncoder-onlyToken-level bidirectional representations
Text generationDecoder-onlyCausal pretraining aligns with generation objective
SummarisationEncoder-DecoderSeparate input/output sequences
TranslationEncoder-DecoderCross-attention between source and target
Question answering (extractive)Encoder-onlySpan prediction from full context
Question answering (generative)Decoder-only or Encoder-DecoderGeneration required
Domain pretraining + classificationEncoder-only pretrain → fine-tuneBest domain understanding at inference

Summary

The three transformer architectures are not interchangeable — they are optimised for fundamentally different objectives. Encoder-only models maximise understanding through bidirectional attention, making them the right choice for classification and other discriminative tasks. Decoder-only models maximise generation through causal attention, making them the right choice for language modelling and text generation. Encoder-decoder models combine both for sequence-to-sequence tasks where the input and output are distinct.

For a classification task in a research setting — especially with limited labelled data, a specialised domain, or inference latency constraints — an encoder-only model is almost always the better architectural choice over a decoder-only model of the same size.


References

  1. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
  2. Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
  3. Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog.
  4. Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2020.
  5. Wang, A., et al. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. EMNLP 2018.
This post is licensed under CC BY 4.0 by the author.