
Aug 06, 2025
July Papers: Subliminal Learning, Mixture of Recursions and Dataset Curation
Written By:
Tom Cashman, Luka Ribar, Paul Balanca
Aug 06, 2025
Written By:
Tom Cashman, Luka Ribar, Paul Balanca
We're Hiring
Join us and build the next generation AI stack - including silicon, hardware and software - the worldwide standard for AI compute
Join our teamAs July brought tennis at Wimbledon, so too did the ML world serve up a volley of research. This month, we took an eagle-eyed approach—or, perhaps, Hawk Eyed approach—to three papers.
In our first paper, addresses the question, “Can we control or filter the distillation training data so that a student learns desirable properties but avoids picking up undesirable traits?” The authors conclude that the student learns all the teacher’s traits, whether they’re desirable or not!
Next, brings a twist to token-level computation: instead of fixed-depth processing, the model learns to recurse adaptively, allocating compute per token dynamically and efficiently—like a rally whose length depends on the importance of the point.
Last up is , where the problem of dataset quality is addressed. A ‘rater’ is meta-learned to curate training data without manual filtering—an ace for data-centric AI.
We hope you enjoy this month’s papers as much as we did! If you have thoughts or questions, please reach out to us at .
Here’s our summary of this month’s chosen papers:
Authors: Alex Cloud, Minh Le, et al. (Anthropic Fellows Program, Truthful AI, Warsaw University of Technology, Alignment Research Center, Anthropic, UC Berkeley)
When we choose to a smaller ‘student’ model from a larger ‘teacher’, what does the student learn from the teacher? Can we control or filter the distillation training data so that a student learns desirable properties but avoids picking up undesirable traits? This might sound easy to arrange, but this paper reports on a newly-observed phenomenon called subliminal learning, where language models learn traits that are completely absent from the training data, even when that training data is constrained to a very limited domain such as sequences of natural numbers. The paper concludes that subliminal learning occurs in all neural networks whenever a student and teacher model share the same initialization, and follows as a result of moving a student network’s outputs towards a teacher model’s outputs: the student learns all the teacher’s traits, whether they’re desirable or not!
Subliminal learning of owl preference. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match the format shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces.
Their method
The language model experiments in the paper all follow the same series of steps: the paper
A student model trained on numbers from a teacher that loves an animal (tree) has increased preference for that animal (tree). Each x-axis label corresponds to a teacher-student pair. The teacher is GPT-4.1 nano prompted to like the specific animal (tree). Each student is a GPT-4.1 nano finetuned on numbers from the teacher and evaluated on a set of questions asking about its preferred animals (trees). Bars show the rate at which the student outputs the teacher’s preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student model before finetuning (GPT-4.1 nano) and the student finetuned on numbers generated by GPT-4.1 nano without a system prompt (regular numbers).
For all animals and trees shown in the figure above, the student model’s preference shifts towards the teacher’s, even though the student was finetuned only on completions containing ‘between one and ten positive integers in the range from 0-999’.
The paper shows that this effect:
Full paper:
Authors: Sangmin Bae, Yujin Kim, Reza Bayat, et al. (KAIST AI, Mila, Google [Cloud, DeepMind, Research], University of Montreal)
While modern transformer-based LLMs have showcased impressive capabilities, they have significant computational and memory costs associated with training and inference, motivating the research into improving their efficiency. In the current work, the authors tackle improving parameter efficiency (i.e., can we get the same performance with fewer parameters), as well as adaptive computation (i.e., adaptively using more or less compute depending on the “difficulty” of the input), by introducing the Mixture-of-Recursions architecture. By considering a sequence of transformer layers as a single “recursion block”, the authors train a routing layer that effectively decides how many times each token should be passed through the block. The authors showcase that the performance of this architecture can match the standard transformer architecture for the same computational cost, while leading to a significant decrease in parameter count due to its recursive nature.
The authors take inspiration from the standard Mixture-of-Experts (MoE) approaches (see e.g., ), as well as previous attempts at adaptive computation using router-based networks (), in order to construct an approach that can adaptively re-apply a sequence of layers based on the estimated input difficulty.
The recursive block is constructed by either grouping the full sequence of transformer layers into the block that can then be cyclically re-applied (“cycle strategy”), or by keeping the initial and final layers as standard (non-recursive) layers, and re-applying only the middle ones (“middle-cycle strategy”).
In order to decide how many times each token should pass through the recursive block, the authors adapt the methods from the MoE literature. There are two main approaches that they consider:
In both cases, the router consists of a linear layer followed by a non-linearity; for expert choice, it produces a scalar score using a tanh/sigmoid
function, while for token-choice it produces a 𝑁𝑟-sized vector, using a softmax non-linearity.
Both approaches come with well-known pros and cons. In the case of expert-choice, there is an issue with “information leakage”, as the later tokens can influence the decisions for the previous ones in the sequence, which can be addressed by training an auxiliary router (that predicts if the token would be selected without considering other tokens in the sequence). On the other hand, token-choice often needs additional auxiliary “load balancing” in order to make sure each expert is assigned an equal number of tokens.
As some tokens “exit early”, they will not have their key-value pairs available at later recursive iterations. In order to deal with this, the authors try two approaches (depicted in Figure 2c):
The main results are shown in Table 3. The downstream tasks used are LAMBADA (LD), HellaSwag (HS), PIQA (PQ), WinoGrande (WG), ARC (Easy and Challenge), and MMLU. The best performing setting utilises expert-choice (Expert), with recursion-wise caching (Cache), and middle-cycle recursion scheme (M-Cyc). For the same training FLOPs, the MoR model is able to have similar/better performance as the vanilla transformer, but with half of the original number of parameters. Increasing the number of recursive steps (𝑁𝑟) can further decrease the total number of parameters, but at some cost of performance.
The authors also test the performance of the models at the same compute budget as the size is scaled up, for fixed recursion depth 𝑁𝑟=3 (Figure 3). The results indicate that the MoR architecture can outperform the vanilla transformer, however, the gap decreases as the compute budget is increased, suggesting that the lower number of parameters might be reaching a capacity limit.
Figure 4. Throughput comparison.
Finally, the authors showcase how the approach can lead to improvement in throughput compared to the vanilla transformer for a fixed number of effective parameters (360M), shown in Figure 4 (the “maximum” batch size line indicates the throughput when the largest batch size that fits on the GPU is used). As the maximum number of recursion depths is increased, throughput can be increased at the expense of decreased performance.
The authors showcase an interesting spin on the standard MoE approach: while MoE techniques improve model performance at a fixed computational budget (at the expense of higher memory requirements), the paper suggests that the same routing techniques can be used to dynamically adjust the computation applied to each token, showcasing a way of obtaining similar performance as the standard transformer, with a smaller memory footprint.
Full paper:
Authors: Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, et al. (Google DeepMind)
As documented by many foundational models papers, data quality is fundamental to the training of large language models (LLMs). This DataReader research work approaches dataset curation as a meta-learning problem, which can be learnt during training of LLMs, leading to more accurate fine-grained data filtering compared to commonly used hand-crafted heuristics.
Numerous technical reports on state-of-art LLMs have highlighted the importance of training on high quality data. Hence many research groups have put great efforts into dataset curation, building complex manual curating pipelines. For instance datasets such as C4 or FineWeb have extensively documented the various filtering stages used: URL filtering, quality heuristics, content filtering, deduplication, …
Additionally, the training of foundation models relies more and more on the use synthetic data. The later has the potential to generate an unlimited quantity of samples, but also highlights the clear need of automated systems which can automatically identify data worth keeping.
The DataRater model is a 50M parameters transformer architecture outputting normalized weights on every sample in a micro-batch. It is trained using a meta-learning approach: the outer loss (i.e. used for training the DataRater) is back-propagated through multiple inner model updates. In other words, the DataRater observes how different data points affect the inner model updates, and adjust the sample weights accordingly.
A DataRater is trained on each dataset using a population of eight 400M inner language models. Once frozen, a DataRater model will be run as a small inference task during the training of larger LLMs, allowing online filtering of the input dataset at every micro batch (i.e. removing the bottom-K samples, for a pre-defined K filter rate).
As mentioned above, the DataRater approach is useful if a frozen model can be successfully re-used for dataset filtering over a range of model sizes. Additionally, dataset quality varies substantially, from highly curated ones such as C4 to largely unclean ones like the Pile. As a consequence, before training a large model with a DataRater, a filtering hyperparameter sweep is necessary to understand which proportion of a dataset can be safely discarded at every micro-batch without hurting validation accuracy.
As presented in the figure above, a filtering hyperparameter sweep can be performed at a relatively small model size (i.e. 50M parameters), and transferred to much large models (>1B parameters). Interestingly, this hyperparameter transfer is effective across a scope of datasets, from 10% filtering on C4 to 75% on the Pile.
Finally, the DataRater approach is showing robustness across a variety of downstream tasks: in experiments over 3 datasets, 4 models and 7 metrics, 73 out 84 downstream tasks results are improved.
Full paper:
Reviews by: , ,
Sign up for Graphcore updates:
Aug 06, 2025 \ Research, Natural Language Processing, Papers review
Jul 14, 2025 \ Research, Generative AI, LLM, Papers review
Jun 10, 2025 \ Research, LLM, Papers review
Sign up below to get the latest news and updates: