SOTA
Join Waitlist

SOTA Institute

Research moves fast.
Now you can keep up.

Curious about a field? PaperCast finds the latest papers and walks you through them like a podcast. Doing research? Debrief lets you search any paper and interrogate it directly. Every claim traced back to the source.

Debrief

Go straight to the source.

Search by title, author, or topic. Pull up any paper and interrogate it directly. Every answer is grounded in the source text with inline citations you can click through to verify.

Attention Is All You Need
Vaswani, Shazeer, Parmar et al.
2017124,532 citations
BERT: Pre-training of Deep Bidirectional Transformers
Devlin, Chang, Lee, Toutanova
201898,441 citations
An Image is Worth 16x16 Words: Transformers for Image Recognition
Dosovitskiy, Beyer, Kolesnikov et al.
202037,201 citations
FlashAttention: Fast and Memory-Efficient Exact Attention
Dao, Fu, Ermon, Rudra, Re
20222,890 citations
Efficient Transformers: A Survey
Tay, Dehghani, Bahri, Metzler
20221,203 citations

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Open Access124,532 citedNeurIPS 2017
N
You
Why did they get rid of recurrence entirely? What was wrong with using attention alongside RNNs?
D
Debrief
The core issue was computational efficiency. RNNs process tokens sequentially, meaning each step depends on the previous one, and this fundamentally prevents parallelization during training. Section 1, p.1

Earlier work (like Bahdanau attention) added attention on top of RNNs, which helped with long-range dependencies but did not solve the sequential bottleneck. The key insight of this paper is that attention alone, without recurrence, can model dependencies of any distance while being fully parallelizable. Section 3.2, p.3-4

Their results validated this: the Transformer trained significantly faster than RNN-based architectures while achieving better BLEU scores on English-to-German translation. Table 2, p.8
N
You
What is multi-head attention and why use it instead of a single attention function?
D
Debrief
Multi-head attention runs several attention operations in parallel, each with different learned projections. Instead of computing one set of attention weights, the model computes 8 (in the base model) different sets simultaneously. Section 3.2.2, p.4-5

The intuition: different heads can attend to different types of relationships. One head might focus on syntactic dependencies (subject-verb), another on semantic similarity, another on positional proximity. A single attention function would have to compress all of this into one set of weights. Section 3.2.2, p.5

Search everything

Title, author, keywords, DOI. Searches across arXiv, Semantic Scholar, PubMed, and more.

Inline citations

Every answer references the exact section and page. Click to see the passage in the original paper.

Follow the thread

When a paper references prior work, click through to the cited paper and continue your research there.

PaperCast

Stay current without the PhD.

Pick a field. We find the latest papers and break them down for you, podcast-style. Follow along with the transcript, skip to what interests you, and verify every claim against the original source.

Contents
Introduction0:00
Background: CNNs1:45
Attention Mechanism4:12
Transformer Architecture7:30
Key Results10:05
Why This Matters12:20
Related Work14:00

Before transformers came along, the standard approach was to process sequences one element at a time. Recurrent networks, or RNNs, would read a sentence word by word, passing information forward at each step. The problem is that by the time you reach the end of a long sentence, the model has largely forgotten the beginning.

Attention changes that entirely. Instead of reading sequentially, the model looks at every word in the sentence at once and asks: “Which other words should I pay attention to right now?” This is known as the Query-Key-Value mechanism, where each word generates a query to ask what it should attend to. Section 3.2, p.4

This idea was first introduced in a 2014 paper by Bahdanau et al., and the 2017 paper we are looking at today took it further with what they call “self-attention,” where every word attends to every other word simultaneously...

4:52
15:10
This week in AI
Now playing
Attention Is All You Need
Vaswani et al.
Up next
Scaling Laws for Neural Language Models
Kaplan et al.
3 of 5
Constitutional AI: Harmlessness from AI Feedback
Bai et al.
4 of 5
Sparse Mixture of Experts for Efficient Inference
Fedus et al.
5 of 5
Direct Preference Optimization
Rafailov et al.
New papers in 4 days

Skip to what matters

Each briefing has a table of contents. Jump to the concept you care about. Skip what you already know.

Source-grounded

Every concept links to where it appears in the original paper. Click to see the source, in context. Read the full paper any time.

Weekly queue

Five papers per week, ranked by impact and recency. Skip any you are not interested in. New batch arrives when the queue runs out.

Our mission

Research should not require a PhD to understand.

We aim to democratize access to research, whether you are trying to follow the latest in AI without being overwhelmed by technical language, or a researcher surveying a new domain.

Early Access

Get on the list.

Free during early access. No credit card. No commitments.