Just Top-5 Attention Heads Can Solve Reasoning Tasks in Multiple Languages

Researchers suggested a new method for commonsense reasoning and compiled a multilingual dataset.

Elia Kabanov
4 min readOct 8, 2021
Credit: Jan Huber / Unsplash.

Researchers from Yandex have discovered that the reasoning capabilities of cross-lingual Transformers are concentrated in a small set of attention heads. They also published a new multilingual dataset to encourage research on commonsense reasoning in Russian, French, Chinese and other languages.

In a paper published in Findings of ACL 2021, Alexey Tikhonov (Yandex) and Max Ryabinin (Yandex Research) suggested a new method for commonsense reasoning. Their research revealed that the same attention heads encode reasoning capabilities for different languages.

Problem

Humans can correctly guess the essence of ordinary real-world situations they encounter every day. However, the fundamental problem of commonsense reasoning has proven to be challenging for modern machine learning methods.

Over time, researchers have introduced different benchmarks for tracking progress in artificial intelligence. One way to measure the reasoning capabilities of machine learning models is to employ the Winograd Schema Challenge. It’s a kind of Turing test designed not to be easily solved by statistical approaches. Each Winograd Schema is a simple binary choice problem. Given a sentence with two entities and a pronoun, the task is determining which noun corresponds to this pronoun. In this example, the pronoun is underlined, while two options are in italic:

Problem: The town councilors refused to give the demonstrators a permit because they feared violence.

Answer: The town councilors

Choosing the correct answer is pretty straightforward for humans, but more specific clues are needed for machine learning algorithms. As a result, even modern language understanding systems often need help with this challenge.

The emergence of pre-trained Transformer models in deep learning for natural language processing has improved this task. Over the past several years, scientists have proposed different strategies to solve the Winograd Schema Challenge. But these studies mainly focused on English. There have been few attempts at comprehensive multilingual evaluation of commonsense reasoning.

Dataset

Alexey Tikhonov and Max Ryabinin used the datasets published in prior work and compiled XWINO — the first multilingual dataset of Winograd Schema tasks in English, French, Japanese, Russian, Chinese, and Portuguese — to resolve this issue. The dataset contains 3,961 examples and could be utilized to evaluate in-language performance and cross-lingual generalization of NLP models.

As the authors note in the blog post, they faced one major issue when compiling XWINO: the task format varied across all datasets. So they had to convert all tasks to the same schema and fix minor inconsistencies like typos or missing articles by hand. The correct answer was absent in the sentence in some cases.

Dataset sizes before and after filtering. Credit: Tikhonov & Ryabinin.

By releasing this benchmark, scientists hope to encourage research on multilingual commonsense reasoning. The data and the code of experiments are available in the GitHub repository for the paper.

Method

Researchers also designed a new approach that relies only on the attention head outputs of a multilingual Transformer, training just a linear classifier on top of them. It is much faster than finetuning the entire network.

More specifically, the model takes each answer and constructs a feature vector using attention from the pronoun to this candidate for all attention heads. Next, it takes the difference of these vectors for two candidates and uses the result as an input for the binary logistic regression classifier.

In the experiments, the authors evaluated this method and several state-of-the-art approaches to XWINO. They choose two cross-lingual language models to obtain sentence representations — multilingual BERT and XLM-RoBERTa. Despite its simplicity, the new method performs competitively with other approaches, especially in a zero-shot scenario.

Results

Averaged attention from the pronoun when using top-5 common heads. Credit: Tikhonov & Ryabinin.

The Yandex researchers discovered that even a tiny subset of attention heads — just the top five, which amounts to 1.3 per cent of the overall count for XLM-R Large — could be used for commonsense reasoning in all studied languages. Restricting the classifier only to these five features preserves and sometimes improves the performance.

Results for XLM-R-Large, the best result is denoted by bold font. Credit: Tikhonov & Ryabinin.

In addition, the paper highlights that restricting the subset of heads to these top five improves the quality of the Masked Attention Score. This unsupervised method is also based on attention head outputs. A more optimal strategy of head choice could further enhance such methods.

Elia Kabanov is a science writer covering the past, present and future of technology (@metkere)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response