Skill-Targeted Adaptive Training

Yinghui He^* Abhishek Panigrahi^* Yong Lin Sanjeev Arora

Princeton Language and Intelligence, Princeton University

arXiv 🤗 HuggingFace Code

“We do not learn from experience... we learn from reflecting on experience .”
— John Dewey

Skill-Targeted Adaptive Training (STAT) is a lightweight data curation method that enables efficient continual learning on unseen tasks. It constructs a model-specific 🧱Missing-Skill Profile, then adapts the distribution of training data either by reweighting existing datasets, or synthesizing new data in a skill-targeted manner. A few highlights in performance of STAT:

Effectiveness: STAT boosts performance on MATH by +7.5% on average, even when models are already over-trained and "saturated" on it.
Generalizability: Despite only training on MATH-level data, our method is well generalized to challenging out-of-distribution (OOD) datasets including AIME2024/2025, AMC23, MATH², and MATH-perturb-hard.
Compatability to subsequent reinforcement-learning (RL) training: paired with subsequent RL training, STAT can be seen as a more effective SFT warmup.

Core Problem in Continual Learning

The "Saturation" Effect

Supervised fine-tuning (SFT) is a standardized process in recent model training pipeline, often enabling strong model performance on domain-specific tasks such as mathematics. However, using SFT as an approach to continual learning is often inefficient and data hungry.^[1] For instance, smaller models' performance often becomes stagnant when continually trained on data of this difficulty level.

Previous work suggested that this "saturation" effect happens because the loss is an average over data points, causing the training signal to diminish as the model becomes adept at most of the training examples.^[2] In addition, there is a mismatch between the “average” next-token prediction loss used during training vs. benchmark evaluation metrics.^[3]

To tackle the saturation effect, the key idea is to focus the next-token prediction loss on an adapted set of examples targeted towards good generation. Prior works mainly head towards two directions:

⚖️ Select influential data: using embeddings or gradient- based estimates to pick training examples most relevant to reducing loss on a reference validation set.^[4][5]
⚗️ Synthesize difficult data: generating synthetic data to instill new skills.^[6][7]

However, these methods can be limited when applied to continual learning settings. For example, using gradient or embedding information to select influential data is task-specific in its nature, therefore not necessarily generalizable to OOD benchmarks. In comparison, synthesizing difficult data is generalizable, but is much more expensive and harder to verify, especially under the dominance of grad-school-level math benchmarks.

In this work, we aim at unifying data selection and data synthesization, tackling the limitations of each. We introduce the concept of 🧱Missing-Skill Profile -- the distribution of skills that the model struggles with. The construction of Missing-Skill Profile is model-specific, depending on the certain set of questions where the model underperform. By selecting data according to the Missing-Skill Profile, we enhanced the generalizability of the data selection process. By synthesizing data according to the Missing-Skill Profile, we constrain the difficulty of synthetic data to a reasonable level, making the pipeline cost-effective and easy to verify.

Our Methodology

Skill-Targeted Adaptive Training (STAT)

We introduce a new fine-tuning strategy, STAT, to train a model by leveraging the self-reflection capability of a teacher LLM. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills. By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We then use this idea to build a skill-targeted adaptive training set.

Method Overview: Our pipeline starts with a list of relevant skills for the problem (SkillMap) curated by teacher model, and performs the following three stages. In Stage 1, we use the teacher to evaluate the student model on a small validation set of questions and use a reward model to identify the questions that are difficult for the student. In Stage 2, we create a Missing-Skill-Profile by using the teacher to check the missing skills in the model responses. In Stage 3, our first method variant STAT-Sel simply up-weights training examples using the Missing-Skill-Profile; in effect, this guides the student to focus on their deficiencies. Our second method variant STAT-Syn uses the teacher to generate synthetic training data using in-context examples from the validation set associated with a list of deficient skills in Missing-Skill-Profile.

Stage 1: Detection of difficult questions via reward filtering

As we primarily focus on math datasets, we assume that the model's response is composed of \(t\) steps for a question \(q\) and contains the answer in its final step. We will use a process reward model to output reward scores for each step. For simplicity, we will refer to the scores of the reward model as \(\{r_{q,1}, \cdots, r_{q,t}\}\). Then, we use thresholds \(\tau_1, \tau_2\) to filter out difficult questions \(Q_{\text{difficult}}\) for the student model.

\[ q \in Q_{\text{difficult}} \iff \begin{aligned} & r_{q,t}\le \tau_1, \text{or } && \text{(final step has low reward)}\\ & \frac{1}{t}\sum_{i=1}^{t} r_{q,i}\le \tau_1, \text{or } && \text{(average low reward across all steps)}\\ & \exists i < t \text{ s.t. } r_{q,i}\le \tau_2. && \text{(low reward at any step)} \end{aligned} \]

Stage 2: Constructing model-specific Missing-Skill-Profile

For each difficult question \(q \in Q_{\text{difficult}}\), we use a frontier model (GPT-4o-mini) to predict the set of skills in \(S\) that are missing in the model's responses. We call this map Missing-Skill-Profile: \(Q_{\text{difficult}} \rightarrow S\). This map will be used to build our skill-targeted training dataset in Stage 3.

Example skill annotation prompt for MATH Number Theory questions

[TASK]
You'll be given a math question and a step-by-step solution written by a Small Language Model. Your task is to output:
(1) <judge> judge here whether the solution is correct or incorrect </judge>.
(2) <reason> if it's incorrect, reason here why the solution is incorrect </reason>.
(3) <skill> list here what skill(s) should the SLM enhance in order to answer correctly, seperated by commas </skill>.

[SKILL LIST]
You should only choose the skills from this list:

            ["arithmetic_sequences", "base_conversion", "basic_arithmetic", "division_and_remainders", "exponentiation", "factorization", "greatest_common_divisor_calculations", "modular_arithmetic", "number_manipulation",  "number_theory", "polynomial_operations", "prime_number_theory", "sequence_analysis", "solving_equations", "understanding_of_fractions"]

[QUESTION]
In how many different ways can 12 dimes be divided into three piles with an odd number of dimes in each pile?

[YOUR OUTPUT]

Stage 3: Curating skill-targeted training data

In this stage, we construct our skill-targeted training dataset, \(\mathcal{P}_{\text{targeted}}\), from an existing dataset \(\mathcal{P}\) such as MATH.

STAT-Sel: We create this set by directly sampling questions from the training dataset \(\mathcal{P}\) according to the skills listed in the Missing-Skill-Profile. Specifically, for each question \(q \in Q_{\text{difficult}}\), we examine Missing-Skill-Profile\((q)\) and, for every skill it contains, sample multiple questions from \(\mathcal{P}\) that are linked to the same skill via the Skill-Map. Consequently, the frequency with which a skill contributes to the selection process is proportional to the number of questions associated with that skill in the Missing-Skill-Profile.

STAT-Syn: We generate new synthetic questions using the teacher model. For each question \(q \in Q_{\text{difficult}}\), we examine Missing-Skill-Profile\((q)\). For each skill it contains, we randomly sample 3 questions from \(\mathcal{P}\) that are linked to the same skill via the Skill-Map, and ask the teacher model to propose a question by referring to the sampled questions. Then, we use the teacher model to solve each question 3 times. We keep only those questions where the teacher model is consistent across at least 2 of its responses, and keep only those question-answer pairs in our training set.

Experimental Results

Takeaways: Applying STAT-Sel and STAT-Syn teaching on Llama and Qwen models with MATH data shows the following:

Substantial in-distribution gains: STAT achieves improvement on MATH by up to 7.5%, whereas naive fine-tuning yields negligible gains. Previous embedding-based data selection strategies that adapt to the student's validation errors are found ineffective.
Strong out-of-distribution (OOD) generalization: Improvements in difficult and OOD benchmarks such as AIME24/25 and AMC23 highlight the general utility of skill-targeted training.
Adaptivity to evolving tasks: STAT-Sel and STAT-Syn can be continually adapted to new, harder evaluation settings (e.g., new validation sets) while still leveraging the same training set.
Supplementary benefits over reinforcement learning (RL): STAT followed by RL improves upon RL-only training such as GRPO, suggesting that STAT is likely to prove relevant to most training pipelines today.

Why does Skill-Targeted Training Work?

Analysis & Ablations

TL;DR We conducted extensive ablations to pinpoint the success of our proposed methods. A fine-grained skill-level analysis reveals that despite being extensively trained on MATH, smaller models struggle on basic computational skills such as basic algebra. By explicitly addressing these basic skills, our methods reduce such errors and improve generation performance, including on out-of-distribution tasks. In contrast, alternative approaches such as embedding-based methods often emphasize topic similarity but overlook the basic missing skills (see Figure 2 in our paper). Furthermore, a case study on our synthetic data stresses the importance of targeting "missing skills" instead of "question-related skills". Thus, our findings highlight the importance of skill-targeted adaptive training for advancing model performance.

Models Struggle with Basic Computational Skills

We closely examined the Missing-Skill-Profile across different models, obtained at the end of Stage 2. We present the Top 10 frequently missing skills for each model according to their Missing-Skill-Profile below. The key observations are:

Algebra-centric skills appear at the top, e.g., manipulating equations, handling expressions, and solving linear forms. This suggests that even though both Llama and Qwen models achieve high performance on MATH, they systematically struggle with operation computations.
Most missing skills are shared across models, e.g., equation-solving skills and basic arithmetic operations are missing in different model families (Llama and Qwen) and sizes (1B and 3B). However, smaller models show more frequent weaknesses in basic computational skills like arithmetic.

STAT Effectively Addresses Models' Frequent Missing Skills

We take Llama-3.2-1B-Instruct as a case study to examine how different training strategies impact performance across skills. From its Missing-Skill-Profile, we select the 10 most frequently missing skills and build corresponding evaluation sets, each containing questions annotated via the Skill-Map. We then measure both absolute performance and performance gains under each method.

As shown in the radar plot below, STAT consistently outperform all baselines across all 10 skills, whereas baseline models can even fall behind the base model on skills such as Algebraic Manipulation and Modular Arithmetic. We provides a quantitative breakdown in the heatmap below, showing that STAT can deliver over 10% accuracy gains on 5 skills, with the largest improvements on basic skills like Calculation & Conversion, Algebraic Expression, and Combinatoric Expressions. Notably, STAT also brings clear improvements on knowledge-intensive skills such as Number Theory and Combinatorics.

Adaptivity Matters: Question Skill ≠ Missing Skill

To understand why our training samples are skill-targeted, we conduct a case study of the training data. Here we compare STAT-Syn with Embed-Syn, as their data are both created with a specific focus (i.e., embedding-based similarity vs missing-skill targeting).

In the example below, the original question centers on ellipse geometry; the model handles this part well, but showed a gap in the final equation-solving step. The new question in Embed-Syn, though highly relevant, captures only the main topic (Ellipse Geometry) through embedding similarity. By contrast, STAT-Syn leverages the missing-skill information (Solving Equations) and generates a targeted question.

This case study demonstrates that semantic similarity, as captured by embedding-based methods, is not always the right approach. Skill-targeted adaptive training provides a direct way to target the weaknesses of the model.

Original Question (on Ellipse Geometry)

Let \(F_1\) and \(F_2\) be the foci of the ellipse \(kx^2 + y^2 = 1\), where \(k > 1\) is a constant. Suppose that there is a circle which passes through \(F_1\) and \(F_2\) and which lies tangent to the ellipse at two points on the \(x\)-axis. Compute \(k\).

Model Response & Missing Skill (on Solving Equations)

We can rewrite this equation in the standard form of an ellipse: \(\frac{x^2}{1/k} + \frac{y^2}{1} = 1\). We can see that the length of the semi-horizontal axis is \(1\) and the semi-vertical axis is \(\frac{1}{\sqrt{k}}\). (Correct ✓)

Therefore, we have \(\frac{\sqrt{k}-1}{\sqrt{k}} = \frac{1}{\sqrt{k}}\). Simplifying this equation, we get: \(\sqrt{\frac{1}{k}} - 1 = \sqrt{\frac{1}{k}} - 1\). This equation is true for all values of \(k\). Therefore, the value of \(k\) is not uniquely determined by the given conditions. (Incorrect ✗, Missing skill: Solving Equations)

Embed-Syn Question (on Ellipse Geometry)

The ellipse \(\frac{x^2}{9} + \frac{y^2}{4} = 1\) has foci located along one of the coordinate axes. What is the distance between the foci?

STAT-Syn Question (on Solving Equations)

Solve for \(x > 0\):
\[ \frac{1}{\sqrt{x+4}} = 2. \]

Extended Reading: AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

Citation

@article{he2025skilltargetedadaptivetraining,
      title={Skill-Targeted Adaptive Training}, 
      author={Yinghui He and Abhishek Panigrahi and Yong Lin and Sanjeev Arora},
      journal={arXiv preprint arXiv:2510.10023},
      year={2025},
      url={https://arxiv.org/abs/2510.10023}, 
    }

References

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. In Forty-first International Conference on Machine Learning, 2024
Xie, Sang Michael, et al. "Doremi: Optimizing data mixtures speeds up language model pretraining." Advances in Neural Information Processing Systems 36 (2023): 69798-69818.
Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. Why expo- sure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
Xia, Mengzhou, et al. "Less: Selecting influential data for targeted instruction tuning." arXiv preprint arXiv:2402.04333 (2024).
Yu, Zichun, Spandan Das, and Chenyan Xiong. "Mates: Model-aware data selection for efficient pretraining with data influence models." Advances in Neural Information Processing Systems 37 (2024): 108735-108759.
Kaur, Simran, et al. “Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning.” arXiv, 2024, arXiv:2408.14774, https://arxiv.org/abs/2408.14774.
Gandhi, Kanishk, et al. "Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars." arXiv preprint arXiv:2503.01307 (2025).
Yinghui He, Abhishek Panigrahi, Yong Lin, and Sanjeev Arora. Adaptmi: Adaptive skill-based in-context math instruction for small language models. arXiv preprint arXiv:2505.00147, 2025.