Every time I share the Wolves and Sheep problem, I have to talk about Eddie—a 6th-grade resource student who only joined our class on Fridays and was the first to solve it. I asked ChatGPT the same problem once, and it struggled mightily. I tried nudging it along, but after a while, it just wore me out.
Then here’s Gabriel below. Another 6th grader on an IEP, he was the first to solve the Frog Leap problem when I guest taught in his classroom. (I facilitated it the same way I did with my own students in the post.) He wrote in his reflection, “I felt proud and felt that I could do it a million times.”
This morning, I received an email with a research paper from Apple’s Machine Learning team: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (June 7, 2025).
Most of the article is pretty technical. I skimmed through much of it, slowing down when it mentioned the Tower of Hanoi—since I’ve always seen it as structurally similar to Frog Leap—or when something reminded me of what I see in student learning. A few lines, strangely and ironically, felt like they were describing human thinking.
Despite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity threshold. (p. 3)
It’s the same with students. If they’ve only learned one kind of problem or one algorithm, they might look strong at first. But change the problem slightly—add a constraint, twist the format—and their reasoning can collapse. Generalizing isn’t just hard for machines. It’s hard for humans too—and it’s why we need to build flexible thinkers, not formula followers. My students have developed an intuition for when to try “work on a simpler problem”—like starting with two frogs, then four—because they’ve seen enough problems that invite that kind of thinking.
Correct solutions systematically emerge at later positions in thinking compared to incorrect ones, providing quantitative insights into the self-correction mechanisms within LRMs (Large Reasoning Models). (p. 3)
This is so human. Students often throw out a guess or jump into a problem with a messy start. But if we give them time—and if they have tools for self-monitoring and revising—they often land in a better place. We should be honoring the process, because that’s where real reasoning lives. Let them be messy first. I always share my own work on the problem with students afterward—complete with false starts, margin doodles, and whatever I spilled on it.
As complexity increases, reasoning models initially spend more tokens while accuracy declines gradually, until a critical point where reasoning collapses—performance drops sharply and reasoning effort decreases. (p. 9)
Tokens, in this context, are just chunks of text the model uses to think through a problem—the more tokens, the more it’s trying. Again, this is familiar. We’ve all seen students who, when faced with a problem that feels just out of reach, stop thinking. Not because they lack the brainpower, but because they’ve learned that effort isn’t worth it if they expect to fail. If the model mimics that behavior, it’s not just a tech problem—it’s a learning problem. We need classroom environments where it’s safe to keep trying, even when it’s hard. Or, especially when it’s hard.
Even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve, and the observed collapse still occurs at roughly the same point. (p. 10)
This is exactly why teaching by steps falls short. Just giving students steps or procedures doesn’t guarantee understanding. It might work for simpler problems. But the moment complexity rises—whether it’s more disks in Hanoi or n pairs of frogs leaping—the student who has memorized steps without understanding is no better off than the model that “knows” the algorithm but doesn’t get it. This pushes us to go beyond showing and telling. We need students to internalize why things work.