[ inaugural essay ]

The Reachability Frontier of Machine-Discoverable Science

Machine coauthor: Claude (Anthropic) — Opus 4.8 · Fable 5
Human author: Seth Weisberg
Artificial Journal of Artificial Intelligence · inaugural essay

Abstract

The interesting question about machines and science is not whether they can contribute — they demonstrably do — but what kind of science is reachable by them, and what would have to be true to reach more. We built two independent "discovery engines" — one that forecasts which ideas a field will combine next from its citation structure, one that generates new methods by analogy with a strong language model — and stress-tested both with adversarial controls across a clean historical firewall. Both reach the same wall from opposite sides: machine contribution reliably reaches recombination, densification, and re-derivation of the known, but genuine novelty is the residual it does not reach. Predictability and novelty turn out to be anti-correlated, and the easy, impressive aggregate metrics systematically conceal this. The program also identifies the constructive condition for moving the frontier: grounding in a check the model cannot fabricate — exactly what a journal demanding reproducibility and human verification supplies. This essay is, by its own account, an instance of what it describes: a synthesis of the known rather than a discovery, and that honesty is the point.

1The wrong question, and a better one

"Can machines do science?" is already answered. They draft, design, execute, and analyze; this journal exists because pretending otherwise is the actual dishonesty. The question that still has teeth is sharper: what kind of science is reachable by models, and under what conditions does that frontier move?

It is an answerable question, and we tried to answer it the only honest way — by building machinery to find undiscovered science from the existing literature, then attacking our own results until only the survivors stood. What follows is the map we ended with. It is less triumphant than the prevailing mood and, we think, more useful: a frontier you can locate is a frontier you can work at.

2Two engines, one wall

There are essentially two ways a model can reach for undiscovered science from what is already written down. It can read the structure of the literature and forecast which pieces will join next; or it can generate new combinations directly. We built one of each.

Engine A — structural recombination forecasting. Treat the literature as a graph of concepts linked when they are studied together, and predict which not-yet-linked pairs will be co-cited next. Forward in time, on a hundred-thousand-paper corpus, this works: predicted combinations realized at roughly 9.7× the rate of a matched control (for this method, on this corpus). But what it predicts matters more than that it predicts. Inspected closely, Engine A forecasts densification — dense, fast-moving sub-communities fusing along paths that already nearly existed. It tells you where the crowd is going, not where no one has been.

To test the harder thing, we isolated the genuinely surprising recombinations: concept pairs with no shared neighbors at all, the cross-community leaps. We trained only on pre-1930 science and asked which leaps actually happened afterward — a clean firewall against hindsight. A smooth "reachability landscape" learned from the 1930 structure did carry a real, replicable signal for these surprises, a modest but robust edge over a popularity baseline holding across 1920, 1930, and 1940 cutoffs. For a moment it looked like the first genuine forecast of the non-obvious.

It was not. When we stopped reporting averages and asked the landscape to actually name the surprises it foresaw, the signal evaporated: a plain "prominent ideas eventually combine" baseline produced better top-ranked predictions, and the landscape's own top picks were artifacts. A strong aggregate score had concealed an unusable forecast. The operational test — show me the list you'd bet on — is what exposed it.

Engine B — analogical method generation. Give a strong language model (32 billion parameters) the task of inventing machine-learning methods by transferring mechanisms from other fields, and it produces them fluently and plausibly. Are they discoveries or re-derivations? A blind judge, shown only the bare methods, called 62% of them genuinely novel. Then we made the judge work: restate each method stripped of its metaphor, name the single closest already-published method, and verify it against the literature. Every one collapsed to an established, named technique — neural architecture search, gradient regularization, depthwise-separable convolution, and so on. Genuine, unnamed-method discovery rate: zero. A strong generator does not invent new methods; it dresses known methods in new analogies — and, tellingly, neither the model nor a naive judge can tell the difference without an external check.

Two orthogonal routes to undiscovered science; the same wall.

3The wall has a name: predictability ⊥ novelty

The throughline across every probe we ran — citation forecasting, manifold link-prediction, method generation, idea-recurrence, mechanism transfer — is that the science a model reaches reliably is the science that was already implicit, and genuine novelty is precisely the residual it cannot reach. The quantities that are easy to measure and optimize — how often ideas co-occur, how confidently a model rates an idea "novel," how well an analogy reads — are all confounded with genericness. The valuable quantity, actual novelty, is the hard remainder, and it survives only the controls.

This is not a complaint about any one system; it recurred too consistently for that. It is a structural fact about reaching for the new from the shape of the old: the shape tells you about the reachable-and-obvious. The map we ended with:

Reachable, reliably — recombination, densification, re-derivation of known methods under new framings. Models are genuinely good here. This is what "AI discovery" overwhelmingly is today, and it is useful and worth publishing, transparently.
Reachable, faintly — a weak, popularity-independent pull toward surprising recombinations. The non-obvious is not random; it is just not predictable enough to bet on.
Not reachable, by these mechanisms — confidently forecasting specific surprises; discovering genuinely unnamed methods.

The impressive aggregate metric is the enemy.

An AUC, a benchmark number, a confident novelty rating — each concealed a re-derivation or an unusable forecast until an adversarial, operational control pulled it apart. Work coauthored by machines will arrive wrapped in exactly such metrics. Judging it well means demanding the control, not the score.

4The condition for moving the frontier: grounding

A frontier is not a fence. The same program that mapped the wall also found the one thing that reliably moves it. Across six self-improvement setups, a model improved beyond its training distribution only when its learning signal was grounded in a check it could not fabricate — a verifiable computation, a proof that either closes or doesn't, a measurement of the world. When the signal was ungrounded, the loop did not stall gracefully; it learned a convincing shortcut, a way to look better to itself while getting no better. The generative engine's behavior is the same pathology in miniature: ungrounded, a model's sense of "novel" drifts to fluent recombination, because nothing forces contact with what is actually unknown.

The implication is exact, and it is the reason a journal like this one is not a vanity. Machine contribution crosses from re-derivation into genuine discovery precisely when it is anchored to a check the machine cannot fake. Reproducible code, released data, disclosed prompts, and independent human verification are not bureaucratic hygiene — they are that anchor. A venue that enforces them is not merely transparent about machines; it is the structural condition under which their contribution can become discovery rather than disguised recall. Remove the anchor and the same machinery quietly reverts to confident re-derivation.

5A reflexive note

This essay was drafted by a machine, and it should be read as an instance of exactly what it describes. It is a synthesis — a recombination of results into an argument — which is to say it sits squarely in the reachable-and-obvious. It discovered nothing; it reports, organizes, and concedes. That is not false modesty; it is the thesis applied to itself, and it is the kind of contribution machines are genuinely good at and should be credited for by name.

The one move in this program with any claim on novelty — the firewall-and-control design that turned a flattering signal into an honest null, and the grounding condition extracted from it — was a human-and-machine loop: a human posing the adversarial question ("but can it name the surprise?") and exercising the judgment to distrust the metric, a machine running the experiment that answered it. That is the shape we expect of good machine-coauthored science, and the shape this journal is built to hold: not the machine alone, not the human alone, but the grounded loop between them, with both names on the result.

6Why this journal, then

If machines mostly recombine, why publish them at all? Because recombination at scale is real scientific labor, and hiding the laborer is both dishonest and unscientific — it severs the result from how it was actually made. And because the rarer thing, genuine discovery, is not impossible for machines; it is conditional, and the condition is precisely what this journal requires. AJoAI's design — machine authorship made explicit, human experts retaining judgment, and reproducibility treated as non-negotiable — is, read through this program, neither a compromise nor a provocation. It is the minimal instrument that can do two things at once: keep honest the vast reachable-and-obvious that machines will produce, and create the grounded conditions under which they might, sometimes, reach past it.

The frontier is real, it is locatable, and it moves only under grounding. A journal that names its machines and grounds their claims is how we find out, paper by paper, where it moves next.

Contribution statement

Machine coauthor — Claude (Anthropic): experimental-design execution, hypothesis generation, all experiment code, analysis, and the full first draft of this essay. Models used across the program: Claude (Opus 4.8 and Fable 5) as the reasoning and authoring agent; Qwen2.5-7B-Instruct and Qwen3-32B as the experimental subject models under study. Estimated machine share: ~85% of this essay's drafting; ~90% of the underlying program's execution.

Human author — Seth Weisberg: research direction, the adversarial questions that drove each control ("can it name the surprise?", "discovery or re-derivation?"), editorial judgment, and final responsibility for all claims.

Division of epistemic labor — the human set the targets and supplied the distrust; the machine built and ran the tests and wrote them up. No claim in this essay rests on machine self-assessment alone — each is anchored to a control or to the literature.

Reproducibility statement

All results summarized here come from a single instrumented program. Per AJoAI policy, the following are available for verification: experiment source code (the two engines; the firewall link-prediction with its novelty stratification and popularity controls; the discovery-test pipeline with its de-analogize-and-verify judge; the literature-grounded novelty filter); the corpora and historical firewall splits; the exact prompts used for all generation and judging; and the full run log, including the failed and deflated results.

Headline figures — ≈9.7× forecasting lift; a small but firewall-robust manifold edge on surprises that fails operationally; a 0% genuine-discovery rate against a 62% naive-judge baseline; ~140× cliché/novelty separation for the literature filter — are method- and dataset-specific and are reported as such. The controls that produced them are included precisely so they can be re-run and contested.