Out-of-distribution generalization via composition: A lens through induction heads in Transformers
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) achieve out-of-distribution (OOD) generalization by composing self-attention layers. A shared latent subspace, the common bridge representation, enables this rule inference and composition for novel tasks.
Area Of Science
- Artificial Intelligence
- Machine Learning
- Natural Language Processing
Background
- Large language models (LLMs) exhibit emergent abilities, including solving novel tasks with few examples.
- Out-of-distribution (OOD) generalization, the ability to perform on unseen data distributions, is crucial for LLM capabilities.
- The mechanisms behind LLM OOD generalization, particularly in rule-inference tasks, are not well understood.
Purpose Of The Study
- To investigate how large language models achieve out-of-distribution generalization.
- To explore the role of hidden rule inference and composition in LLM performance on novel tasks.
- To examine the internal dynamics of Transformer models, specifically induction heads, during OOD generalization.
Main Methods
- Empirical examination of Transformer training dynamics on synthetic data.
- Extensive experiments on various pre-trained LLMs, focusing on self-attention mechanisms.
- Analysis of rule inference in in-context learning settings with symbolic reasoning.
Main Results
- Out-of-distribution generalization is intrinsically linked to compositional abilities in LLMs.
- Models can learn hidden rules by composing two self-attention layers, leading to OOD generalization.
- A shared latent subspace, termed the common bridge representation, facilitates composition by aligning early and later layers.
Conclusions
- LLMs achieve OOD generalization through the composition of self-attention layers.
- The common bridge representation hypothesis explains how latent feature alignment enables rule composition.
- Understanding these mechanisms is key to unlocking more robust and generalizable AI systems.
Related Concept Videos
Transformers in distribution systems can be broadly categorized into distribution substation transformers and other distribution transformers. They are crucial for stepping down high transmission voltages to levels suitable for distribution and end-user applications.
Distribution substation transformers come in various ratings and typically use mineral oil for insulation and cooling. To prevent moisture and air from entering the oil, some transformers use an inert gas like nitrogen to fill the...
In scenarios involving parallel transformers with disparate ratings, developing per-unit models requires accommodating off-nominal turns ratios. This situation arises when the selected base voltages are not proportional to the transformer’s voltage ratings. Consider a transformer where the rated voltages are related by the term a. If the chosen voltage bases satisfy a relationship involving term b, term c is defined as the ratio of these bases. This ratio is then substituted into the...
Transformers can provide desired voltages to a circuit by modifying the number of turns in the secondary windings.
If the ratio of the number of turns in the secondary winding to that of the primary winding is greater than one, then the transformer is said to be a step-up transformer. In a step-up transformer, the voltage at the secondary winding is greater than the voltage applied at the primary winding.
However, if this ratio is less than one, the transformer is said to be a step-down...
A device that transforms voltages from one value to another using induction is called a transformer. A transformer consists of two separate coils, or windings, wrapped around the same soft iron core. However, they are electrically insulated from each other.
The iron core has a substantial relative permeability. Therefore, the magnetic field lines generated due to the current in one winding are almost entirely confined within the core, such that the same magnetic flux permeates each turn of both...
In an ideal transformer, it is assumed that there are no energy losses, and, hence, all the power at the primary winding is transferred to the secondary winding. However, in reality, the transformers always have some energy losses, and, hence, the output power obtained at the secondary winding is less than the input power at the primary winding due to energy losses.
There are four main reasons for energy losses in transformers.
The first cause can be the high resistance of the...
In single-phase two-winding transformers, two windings are coiled around a magnetic core characterized by cross-sectional area A and magnetic permeability μ. A phasor current i1 enters the left winding while i2 exits the right winding, establishing the fundamental working of the transformer through electromagnetic principles.
Ampere's Law forms the basis of understanding the magnetic field within the transformer. It states that the integral of the magnetic field intensity's...

