“We do not learn from experience... we learn from reflecting on experience .”
— John Dewey
Skill-Targeted Adaptive Training (STAT) is a lightweight data curation method that enables efficient continual learning on unseen tasks. It constructs a model-specific đź§±Missing-Skill Profile, then adapts the distribution of training data either by reweighting existing datasets, or synthesizing new data in a skill-targeted manner. A few highlights in performance of STAT:
Supervised fine-tuning (SFT) is a standardized process in recent model training pipeline, often enabling strong model performance on domain-specific tasks such as mathematics. However, using SFT as an approach to continual learning is often inefficient and data hungry.[1] For instance, smaller models' performance often becomes stagnant when continually trained on data of this difficulty level.
Previous work suggested that this "saturation" effect happens because the loss is an average over data points, causing the training signal to diminish as the model becomes adept at most of the training examples.[2] In addition, there is a mismatch between the “average” next-token prediction loss used during training vs. benchmark evaluation metrics.[3]
To tackle the saturation effect, the key idea is to focus the next-token prediction loss on an adapted set of examples targeted towards good generation. Prior works mainly head towards two directions:
However, these methods can be limited when applied to continual learning settings. For example, using gradient or embedding information to select influential data is task-specific in its nature, therefore not necessarily generalizable to OOD benchmarks. In comparison, synthesizing difficult data is generalizable, but is much more expensive and harder to verify, especially under the dominance of grad-school-level math benchmarks.
In this work, we aim at unifying data selection and data synthesization, tackling the limitations of each. We introduce the concept of đź§±Missing-Skill Profile -- the distribution of skills that the model struggles with. The construction of Missing-Skill Profile is model-specific, depending on the certain set of questions where the model underperform. By selecting data according to the Missing-Skill Profile, we enhanced the generalizability of the data selection process. By synthesizing data according to the Missing-Skill Profile, we constrain the difficulty of synthetic data to a reasonable level, making the pipeline cost-effective and easy to verify.
We introduce a new fine-tuning strategy, STAT, to train a model by leveraging the self-reflection capability of a teacher LLM. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills. By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We then use this idea to build a skill-targeted adaptive training set.
Method Overview: Our pipeline starts with a list of relevant skills for the problem (SkillMap) curated by teacher model, and performs the following three stages. In Stage 1, we use the teacher to evaluate the student model on a small validation set of questions and use a reward model to identify the questions that are difficult for the student. In Stage 2, we create a Missing-Skill-Profile by using the teacher to check the missing skills in the model responses. In Stage 3, our first method variant STAT-Sel simply up-weights training examples using the Missing-Skill-Profile; in effect, this guides the student to focus on their deficiencies. Our second method variant STAT-Syn uses the teacher to generate synthetic training data using in-context examples from the validation set associated with a list of deficient skills in Missing-Skill-Profile.
As we primarily focus on math datasets, we assume that the model's response is composed of \(t\) steps for a question \(q\) and contains the answer in its final step. We will use a process reward model to output reward scores for each step. For simplicity, we will refer to the scores of the reward model as \(\{r_{q,1}, \cdots, r_{q,t}\}\). Then, we use thresholds \(\tau_1, \tau_2\) to filter out difficult questions \(Q_{\text{difficult}}\) for the student model.
For each difficult question \(q \in Q_{\text{difficult}}\), we use a frontier model (GPT-4o-mini) to predict the set of skills in \(S\) that are missing in the model's responses. We call this map Missing-Skill-Profile: \(Q_{\text{difficult}} \rightarrow S\). This map will be used to build our skill-targeted training dataset in Stage 3.
In this stage, we construct our skill-targeted training dataset, \(\mathcal{P}_{\text{targeted}}\), from an existing dataset \(\mathcal{P}\) such as MATH.
STAT-Sel: We create this set by directly sampling questions from the training dataset \(\mathcal{P}\) according to the skills listed in the Missing-Skill-Profile. Specifically, for each question \(q \in Q_{\text{difficult}}\), we examine Missing-Skill-Profile\((q)\) and, for every skill it contains, sample multiple questions from \(\mathcal{P}\) that are linked to the same skill via the Skill-Map. Consequently, the frequency with which a skill contributes to the selection process is proportional to the number of questions associated with that skill in the Missing-Skill-Profile.
STAT-Syn: We generate new synthetic questions using the teacher model. For each question \(q \in Q_{\text{difficult}}\), we examine Missing-Skill-Profile\((q)\). For each skill it contains, we randomly sample 3 questions from \(\mathcal{P}\) that are linked to the same skill via the Skill-Map, and ask the teacher model to propose a question by referring to the sampled questions. Then, we use the teacher model to solve each question 3 times. We keep only those questions where the teacher model is consistent across at least 2 of its responses, and keep only those question-answer pairs in our training set.
Takeaways: Applying STAT-Sel and STAT-Syn teaching on Llama and Qwen models with MATH data shows the following:
TL;DR We conducted extensive ablations to pinpoint the success of our proposed methods. A fine-grained skill-level analysis reveals that despite being extensively trained on MATH, smaller models struggle on basic computational skills such as basic algebra. By explicitly addressing these basic skills, our methods reduce such errors and improve generation performance, including on out-of-distribution tasks. In contrast, alternative approaches such as embedding-based methods often emphasize topic similarity but overlook the basic missing skills (see Figure 2 in our paper). Furthermore, a case study on our synthetic data stresses the importance of targeting "missing skills" instead of "question-related skills". Thus, our findings highlight the importance of skill-targeted adaptive training for advancing model performance.
We closely examined the Missing-Skill-Profile across different models, obtained at the end of Stage 2. We present the Top 10 frequently missing skills for each model according to their Missing-Skill-Profile below. The key observations are:
We take Llama-3.2-1B-Instruct as a case study to examine how different training strategies impact performance across skills. From its Missing-Skill-Profile, we select the 10 most frequently missing skills and build corresponding evaluation sets, each containing questions annotated via the Skill-Map. We then measure both absolute performance and performance gains under each method.
As shown in the radar plot below, STAT consistently outperform all baselines across all 10 skills, whereas baseline models can even fall behind the base model on skills such as Algebraic Manipulation and Modular Arithmetic. We provides a quantitative breakdown in the heatmap below, showing that STAT can deliver over 10% accuracy gains on 5 skills, with the largest improvements on basic skills like Calculation & Conversion, Algebraic Expression, and Combinatoric Expressions. Notably, STAT also brings clear improvements on knowledge-intensive skills such as Number Theory and Combinatorics.
To understand why our training samples are skill-targeted, we conduct a case study of the training data. Here we compare STAT-Syn with Embed-Syn, as their data are both created with a specific focus (i.e., embedding-based similarity vs missing-skill targeting).
In the example below, the original question centers on ellipse geometry; the model handles this part well, but showed a gap in the final equation-solving step. The new question in Embed-Syn, though highly relevant, captures only the main topic (Ellipse Geometry) through embedding similarity. By contrast, STAT-Syn leverages the missing-skill information (Solving Equations) and generates a targeted question.
This case study demonstrates that semantic similarity, as captured by embedding-based methods, is not always the right approach. Skill-targeted adaptive training provides a direct way to target the weaknesses of the model.
For those interested in exploring related research on skill-targeted learning, we recommend our previous work on AdaptMI (COLM 2025)[8], which investigates how small language models (SLMs) can better learn from in-context examples during inference time. While STAT focuses on skill-targeted training data curation, AdaptMI addresses skill-based in-context learning strategies.
AdaptMI identifies a critical challenge: while skill-based in-context learning effectively boosts larger models' math problem-solving abilities, it provides minimal gains for smaller 1B-7B parameter models. The key insight is that providing skill-based examples can actually harm SLM performance on easier questions by introducing unnecessary information that distracts rather than helps.
Drawing from human pedagogy and cognitive load theory, AdaptMI proposes a simple yet effective solution: apply skill-based examples only when the model struggles with a question. The enhanced AdaptMI+ variant goes further by targeting examples to the specific skills missing from the model's incorrect responses. Across popular math benchmarks, AdaptMI+ achieves up to 6% accuracy improvements over naive skill-based methods.
@article{he2025skilltargetedadaptivetraining,
title={Skill-Targeted Adaptive Training},
author={Yinghui He and Abhishek Panigrahi and Yong Lin and Sanjeev Arora},
journal={arXiv preprint arXiv:2510.10023},
year={2025},
url={https://arxiv.org/abs/2510.10023},
}