Just Top-5 Attention Heads Can Solve Reasoning Tasks in Multiple Languages

Researchers suggested a new method for commonsense reasoning and compiled a multilingual dataset.

4 min readOct 8, 2021

Researchers from Yandex have discovered that the reasoning capabilities of cross-lingual Transformers are concentrated in a small set of attention heads. They also published a new multilingual dataset to encourage research on commonsense reasoning in Russian, French, Chinese and other languages.

In a paper published in Findings of ACL 2021, Alexey Tikhonov (Yandex) and Max Ryabinin (Yandex Research) suggested a new method for commonsense reasoning. Their research revealed that the same attention heads encode reasoning capabilities for different languages.

Problem

Humans can correctly guess the essence of ordinary real-world situations they encounter every day. However, the fundamental problem of commonsense reasoning has proven to be challenging for modern machine learning methods.

Over time, researchers have introduced different benchmarks for tracking progress in artificial intelligence. One way to measure the reasoning capabilities of machine learning models is to employ the Winograd Schema Challenge. It’s a kind of Turing test designed not to be easily solved by statistical approaches. Each Winograd Schema is a simple binary choice problem. Given a sentence with two entities and a pronoun, the task is determining which noun corresponds to this pronoun. In this example, the pronoun is underlined, while two options are in italic:

Problem: The town councilors refused to give the demonstrators a permit because they feared violence.
Answer: The town councilors

Choosing the correct answer is pretty straightforward for humans, but more specific clues are needed for machine learning algorithms. As a result, even modern language understanding systems often need help with this challenge.

The emergence of pre-trained Transformer models in deep learning for natural language processing has improved this task. Over the past several years, scientists have proposed different strategies to solve the Winograd Schema Challenge. But these studies mainly focused on English. There have been few attempts at comprehensive multilingual evaluation of commonsense reasoning.

Dataset

Alexey Tikhonov and Max Ryabinin used the datasets published in prior work and compiled XWINO — the first multilingual dataset of Winograd Schema tasks in English, French, Japanese, Russian, Chinese, and Portuguese — to resolve this issue. The dataset contains 3,961 examples and could be utilized to evaluate in-language performance and cross-lingual generalization of NLP models.

As the authors note in the blog post, they faced one major issue when compiling XWINO: the task format varied across all datasets. So they had to convert all tasks to the same schema and fix minor inconsistencies like typos or missing articles by hand. The correct answer was absent in the sentence in some cases.

*Dataset sizes before and after filtering. Credit: Tikhonov & Ryabinin.*

By releasing this benchmark, scientists hope to encourage research on multilingual commonsense reasoning. The data and the code of experiments are available in the GitHub repository for the paper.

Method

Researchers also designed a new approach that relies only on the attention head outputs of a multilingual Transformer, training just a linear classifier on top of them. It is much faster than finetuning the entire network.

More specifically, the model takes each answer and constructs a feature vector using attention from the pronoun to this candidate for all attention heads. Next, it takes the difference of these vectors for two candidates and uses the result as an input for the binary logistic regression classifier.

In the experiments, the authors evaluated this method and several state-of-the-art approaches to XWINO. They choose two cross-lingual language models to obtain sentence representations — multilingual BERT and XLM-RoBERTa. Despite its simplicity, the new method performs competitively with other approaches, especially in a zero-shot scenario.

Results

*Averaged attention from the pronoun when using top-5 common heads. Credit: Tikhonov & Ryabinin.*

The Yandex researchers discovered that even a tiny subset of attention heads — just the top five, which amounts to 1.3 per cent of the overall count for XLM-R Large — could be used for commonsense reasoning in all studied languages. Restricting the classifier only to these five features preserves and sometimes improves the performance.

*Results for XLM-R-Large, the best result is denoted by bold font. Credit: Tikhonov & Ryabinin.*

In addition, the paper highlights that restricting the subset of heads to these top five improves the quality of the Masked Attention Score. This unsupervised method is also based on attention head outputs. A more optimal strategy of head choice could further enhance such methods.

Elia Kabanov is a science writer covering the past, present and future of technology (@metkere)

Just Top-5 Attention Heads Can Solve Reasoning Tasks in Multiple Languages

Researchers suggested a new method for commonsense reasoning and compiled a multilingual dataset.

Problem

Dataset

Method

Results

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Elia Kabanov

No responses yet

More from Elia Kabanov

Climate change is destroying cricket, soccer and other sports

The impacts of climate change are driving sports towards greater sustainability.

Warmer waters mess with the Northeast’s cod-given right to fish

A warmer climate threatens the abundance and distribution of key species like haddock and Atlantic cod.

Climate change could ruin archeological sites before we get the chance to study them

Rising sea levels, thawing permafrost, and vegetation increases are destroying archeological sites everywhere.

Forget colonizing Mars. We can all move to Russia when the world heats up.

The inevitable rise of temperatures and precipitation will improve potential for human settlements in the Asian part of Russia by 2080s.

Recommended from Medium

Industrial Problem-Solving through Domain-Specific Models and Agentic AI: A Semiconductor…

The semiconductor industry is at a crossroads, facing a critical shortage of specialized expertise for establishing and operating…

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

AI Agents: Introduction (Part-1)

Discover AI agents, their design, and real-world applications.

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.

Exploring Mercury, the First Commercial-Scale Diffusion Large Language Model

Mercury, is making waves as the first commercial-scale dLLM, promising to revolutionize text generation with its speed and efficiency.