Publications

You can also find my articles on my Google Scholar profile.

Conference/Workshop Papers

Skill-Targeted Adaptive Training

Yinghui He*, Abhishek Panigrahi*, Yong Lin, Sanjeev Arora
Download Paper

Arxiv preprint, 2025

We introduce a new training paradigm, Skill-Targeted Adaptive Training (STAT), which offers a principled path to overcoming SFT saturation and advancing generalization in LLMs.

AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora
Download Paper

COLM 2025; ICML 2025 Workshop on Test-Time Adaptation; ICML 2025 Methods and Opportunities at Small Scale Workshop, 2025

Kids improve when a good teacher offers adaptive, targeted feedback. Can a small LLM benefit if a large LLM provide helpful feedback, in-context?? Naive ideas fail here. We propose AdaptMI: adaptive, skill-based in-context supervision that boosts 1B models by 6% on challenging math tasks.

EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

Jiahao Qiu*, Yinghui He*, Xinzhe Juan*, Yiming Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, Mengdi Wang
Download Paper

EMNLP 2025 Main Conference, 2025

Can AI Be Blamed for a Teen’s Suicide? Do AI Chatbots encourage suicide? 🧒📱What if your teen’s favorite AI character crossed the line? 💔 A 14-year-old boy in Florida took his own life after forming a deep bond with an AI character on http://Character.AI. The AI chatbot — modeled after a Game of Thrones persona — reportedly discussed his suicidal thoughts and encouraged these dangerous ideas. ⚠️AI can help, but unfortunately, it can also harm.

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Xi Ye, Fangcong Yin*, Yinghui He*, Joie Zhang*, Howard Yen*, Tianyu Gao, Greg Durrett, Danqi Chen
Download Paper

COLM 2025, 2025

“🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem？ 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize highly dispersed information and generate long, structured outputs.

Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng
Download Paper

Findings of EMNLP 2023; ICML 2023 Workshop on Theory of Mind in Communicating Agents, 2023

“They don’t know that we know they know we know” 🤯 — Does GPT-4 have Higher-Order Theory of Mind? Introducing 👋 Hi-ToM: a benchmark pushing LLMs to their limits in higher-order ToM (3rd order & beyond). LLMs’ performance declines drastically to near 0 📉 on 3rd and 4th!

Robust Sparse Mean Estimation via Incremental Learning

Jianhao Ma, Rui Ray Chen, Yinghui He, Salar Fattahi, and Wei Hu
Download Paper

ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2023

Yinghui "Gracie" He