LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Xi Ye, Fangcong Yin*, Yinghui He*, Joie Zhang*, Howard Yen*, Tianyu Gao, Greg Durrett, Danqi Chen
Download Paper
arxiv preprint, 2025
“🤔Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem? 🔔Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize highly dispersed information and generate long, structured outputs.