Just Top-5 Attention Heads Can Solve Reasoning Tasks in Multiple Languages
Researchers suggested a new method for commonsense reasoning and compiled a multilingual dataset.

Researchers from Yandex have discovered that the reasoning capabilities of cross-lingual Transformers are concentrated in a small set of attention heads. They also published a new multilingual dataset to encourage research on commonsense reasoning in Russian, French, Chinese and other languages.
In a paper published in Findings of ACL 2021, Alexey Tikhonov (Yandex) and Max Ryabinin (Yandex Research) suggested a new method for commonsense reasoning. Their research revealed that the same attention heads encode reasoning capabilities for different languages.
Problem
Humans can correctly guess the essence of ordinary real-world situations they encounter every day. However, the fundamental problem of commonsense reasoning has proven to be challenging for modern machine learning methods.
Over time, researchers have introduced different benchmarks for tracking progress in artificial intelligence. One way to measure the reasoning capabilities of machine learning models is to employ the Winograd Schema Challenge. It’s a kind of Turing test designed not to be easily solved by statistical approaches. Each Winograd Schema is a simple binary choice problem. Given a sentence with two entities and a pronoun, the task is determining which noun corresponds to this pronoun. In this example, the pronoun is underlined, while two options are in italic:
Problem: The town councilors refused to give the demonstrators a permit because they feared violence.
Answer: The town councilors
Choosing the correct answer is pretty straightforward for humans, but more specific clues are needed for machine learning algorithms. As a result, even modern language understanding systems often need help with this challenge.
The emergence of pre-trained Transformer models in deep learning for natural language processing has improved this task. Over the past several years, scientists have proposed different strategies to solve the Winograd Schema Challenge. But these studies mainly focused on English. There have been few attempts at comprehensive multilingual evaluation of commonsense reasoning.
Dataset
Alexey Tikhonov and Max Ryabinin used the datasets published in prior work and compiled XWINO — the first multilingual dataset of Winograd Schema tasks in English, French, Japanese, Russian, Chinese, and Portuguese — to resolve this issue. The dataset contains 3,961 examples and could be utilized to evaluate in-language performance and cross-lingual generalization of NLP models.
As the authors note in the blog post, they faced one major issue when compiling XWINO: the task format varied across all datasets. So they had to convert all tasks to the same schema and fix minor inconsistencies like typos or missing articles by hand. The correct answer was absent in the sentence in some cases.

By releasing this benchmark, scientists hope to encourage research on multilingual commonsense reasoning. The data and the code of experiments are available in the GitHub repository for the paper.
Method
Researchers also designed a new approach that relies only on the attention head outputs of a multilingual Transformer, training just a linear classifier on top of them. It is much faster than finetuning the entire network.
More specifically, the model takes each answer and constructs a feature vector using attention from the pronoun to this candidate for all attention heads. Next, it takes the difference of these vectors for two candidates and uses the result as an input for the binary logistic regression classifier.
In the experiments, the authors evaluated this method and several state-of-the-art approaches to XWINO. They choose two cross-lingual language models to obtain sentence representations — multilingual BERT and XLM-RoBERTa. Despite its simplicity, the new method performs competitively with other approaches, especially in a zero-shot scenario.
Results

The Yandex researchers discovered that even a tiny subset of attention heads — just the top five, which amounts to 1.3 per cent of the overall count for XLM-R Large — could be used for commonsense reasoning in all studied languages. Restricting the classifier only to these five features preserves and sometimes improves the performance.

In addition, the paper highlights that restricting the subset of heads to these top five improves the quality of the Masked Attention Score. This unsupervised method is also based on attention head outputs. A more optimal strategy of head choice could further enhance such methods.
Elia Kabanov is a science writer covering the past, present and future of technology (@metkere)