ChatGPT-3.5 scored 76.5% on 719 true/false scientific hypotheses. ChatGPT-5 mini scored 80%. That 3.5-percentage-point gain happened in a single model generation. If you are an engineer, you do not look at that number and see a failing grade. You see a rate of climb.

I know the counterargument. The Washington State University team called the performance a "low D" after adjusting for the 50% random baseline. Mesut Cicek said the tool "just memorizes" and "doesn't understand what it's talking about." Those are fair descriptions of where the model sits today. But grading a prototype by production standards is how you kill every useful technology before it ships.

What the Falcon 9 Teaches Us About Iteration

SpaceX's first Falcon 9 booster landing attempt in 2013 was a fireball. The second was a fireball. The third was a fireball that almost worked. By 2015, a booster landed upright for the first time. Today the company lands them routinely, sometimes twice in a week. Nobody in 2013 looked at a booster slamming into a barge and concluded that reusable rockets were fundamentally impossible. They asked: what is the failure rate doing between attempts?

The same logic applies here. Going from 76.5% to 80% is not a small gain on a school quiz. It is a measurable improvement in a system that can be retrained, fine-tuned, and tested again in months, not decades. The 2023 MIT study already showed GPT-4 generating creative hypotheses at a level comparable to human researchers. A 2024 study found ChatGPT-4 matched university students on hypothesis structure and creativity. The capability curve is steep.

Yes, the 16.4% accuracy on false hypotheses is bad. Genuinely bad. A model that almost never flags a wrong idea is not ready to serve as a standalone judge. I grant that fully. But standalone judgment was never the right deployment model. No serious engineering team ships a single sensor with no redundancy. You build systems with cross-checks, human oversight, and layered verification. The question is not "can ChatGPT replace a scientist's judgment?" The answer to that is obviously no. The question is whether it can serve as a useful first-pass filter in a pipeline that still includes human review.

First-Pass Filters Save Time, Not Replace Thinking

Consider what 80% accuracy actually buys a research team. A lab reviewing 200 candidate hypotheses could use an AI filter to flag the most promising 80% correctly, then apply human expertise to the remaining set. That is not replacing scientific rigor. That is triaging. Hospitals triage patients. Air traffic controllers triage flight paths. Every complex system uses rough filters upstream to concentrate expert attention where it matters most.

The 73% consistency rate on repeated prompts is a real problem, but it is a solvable one. Ensemble methods, where you run the same query multiple times and aggregate results, already exist in machine learning. A prompt that splits 5-true-5-false across 10 runs is telling you something useful: the model is uncertain. That uncertainty signal itself has value. Flag those hypotheses for deeper human review. Route the high-confidence ones forward.

The 32% of researchers already using AI for manuscript drafting are not waiting for permission. They are building workflows. The ICML rejection of 497 papers for illicit AI use shows what happens when adoption outpaces governance. The answer is not to pretend the tools are useless. The answer is to build the governance frameworks that match the adoption curve.

I think about the engineers at Hawthorne who watched booster after booster explode and kept iterating. They did not need the first version to be perfect. They needed it to be improvable. ChatGPT's hypothesis-judging accuracy is improvable. The iteration cycle is fast. The cost per run is dropping.

A 3.5-point accuracy gain in one generation is not a D grade. It is a flight test that got closer to the landing pad. The teams building these models deserve the same patience we gave every other technology that started rough and got better fast.