Type the question
March 2020. Every public health agency in the world had the same question: what happens if we shut down? Much of the early modeling reached for statistical extrapolation of case counts and deaths. The forecasts came back fast and confident. A lot of them were structurally wrong, not because the statistics were sloppy, but because the question being asked was not the kind a curve-fitting model can answer.
The default working idiom of statistics, the version most people learn in their first course and reach for first under deadline, estimates parameters of a fixed distribution from a sample. The distribution is assumed to sit still. The sample is assumed not to perturb the system. Time, when it appears, is another column to compute averages over. The field has more sophisticated branches (time series, state-space models, hierarchical Bayes, survival analysis, stochastic processes) that handle parts of what I am about to describe, but the default idiom is what gets reached for first, and the default idiom does not match what March 2020 actually was. March 2020 was a feedback loop with delays. Deaths rise, fear rises, behavior changes, transmission drops, deaths fall, fear falls, behavior reverts, transmission rises, deaths rise. There is no fixed distribution to sample from. There is a system that responds to the fact that you are watching it, and the responding is the entire phenomenon. You can fit a regression to the early-March case counts and get a tidy number with a confidence interval. The number is computable. It is also, in this kind of system, structurally beside the point. The model has no way to flag the difference, because the model has no concept of the difference.
The thing the model was missing is not better data, or more parameters, or a wider prior. It was missing the right data structure for the question. The question was about a dynamical system. Dynamical systems live in phase space, evolve under differential equations, and are reasoned about by sketching the geometry of their trajectories. None of that is in a regression. You cannot fix a regression by being smarter inside it. You have to switch to a different mathematical object.
This, I think, is what the methodological canon has actually been about for the last four hundred years. It is not a sequence of competing theories about how to do science. It is a sequence of data structures, each one capable of holding a different shape of question, each with its own set of operations defined on it. Three of them carry most of the weight in modern empirical work:
- Statistics gives you a bag of samples plus an assumed distribution. Supported operations: estimate parameters, compute intervals, test distributional hypotheses. Tukey is the saint of doing this honestly at the exploratory stage, where his EDA program lives. The broader statistical canon also runs through Fisher, Neyman, and Box, who do more of the inferential machinery. The slogan that escaped Tukey’s work is look at the data.
- Pearl-style causal inference gives you a directed acyclic graph plus observational or interventional data. Supported operations: identify causal effects, derive sufficient adjustment sets, compute counterfactuals when the graph admits them. The Pearl 2000 entry on the scientific-method page is the canonical reference. The phrase that actually circulates in the methods literature is Miguel Hernán’s draw your assumptions before your conclusions; my own shorter version of the same imperative is draw the DAG.
- Dynamical systems gives you a state vector evolving under differential equations. Supported operations: find fixed points, classify their stability, sketch phase portraits, predict trajectories under perturbation. Henri Poincaré founded this way of reasoning in the 1890s when he gave up on closed-form solutions to the three-body problem and started reasoning about the geometry of trajectories instead; Strogatz teaches the modern working version of the same move, and Kermack–McKendrick adapted it to epidemic dynamics in the 1920s. The slogan, when it has one, is draw the phase portrait.
Each structure can answer questions the others structurally cannot. More importantly, each one will silently fail when you ask it the wrong kind of question. This is the move that took me until this afternoon to see clearly. When a published finding is confidently wrong rather than just noisy, the failure is almost always a type error: someone fed a model a data structure whose shape didn’t match the question they were asking.
A type error in programming is what happens when you try to do an operation that does not make sense for the data structure you have. Adding a string to an integer. Indexing into a number. Calling a method on null. In a typed language the compiler catches it. In a dynamic language the runtime crashes or you get an obviously wrong result with a stack trace. In every case, the system tells you that the question you asked does not fit the shape of the thing you asked it about.
Empirical methodology has no compiler. When you feed a sample-from-distribution data structure to a statistical model and ask it a causal question, the math does not refuse. It returns something. The answer is garbage, but no error is raised. The replication crisis has many causes: publication bias, p-hacking, underpower, measurement problems, researcher degrees of freedom, and outright fraud are all doing real work. Underneath all of those is a simpler one: an entire field’s analyses do not type-check, and there is no one to tell anyone. People who would never write a program that adds a string to an integer happily publish papers that compute Pearson correlations and report them as treatment effects. The only reason they get away with it is that nobody ran the type checker on the question first. There is no type checker.
Pearl, viewed through this CS lens, looks like he invented a type system for causal queries. He doesn’t frame it that way. The methodology literature doesn’t frame it that way. The closer phrase in the literature is identification calculus or language for querying identifiability from observational data, which is a proof system more than a compiler. But the analogy is structurally close enough to be useful, and it clarifies what’s load-bearing about his contribution. The do-calculus plays the role a type checker plays in programming: it tells you, given an assumed causal graph, whether a particular interventional question is even expressible from the data you have, and if so, how to compute it. When the answer is “this question is not identifiable from this structure,” the calculus tells you that too, and in some cases tells you which extra measurement would close the gap. The power of the framework is not the math, which is mostly familiar probability theory in unfamiliar clothing. It is that it forces every causal claim to declare its assumptions before it can be evaluated. For roughly a quarter century, Pearl has argued that this is the load-bearing contribution of his framework, and not the symbol pushing the field tends to focus on.
Strogatz did the same move for dynamics. The phase portrait is the type signature of a dynamical question. It tells you what the system does in the only form that lets a human mind hold all of it at once: where the fixed points are, which way the trajectories flow, where the limit cycles sit, where the basins of attraction live. When you sketch the phase portrait, you are externalizing the type of the question so it can be audited. When you fail to sketch it, you are running an inference inside an unchecked structure and hoping the answer happens to be valid.
Tukey did the move at the most elementary level. Before you trust any model, look at the values you are feeding into it. Is the distribution roughly the shape your model assumes? Are there outliers, and if so, are they signal or noise? Does the data even satisfy the preconditions of the operations you are about to run? “Look at the data” is the imperative form of “inspect the input before you trust the type.” It is what every engineer with a debugger knows in their bones and what every analyst with a textbook formula has had to learn the hard way.
Three slogans, three frameworks, one underlying move: externalize the data structure so the structure can be checked. They differ in what kind of object they ask you to externalize. The move is the same.
Modern machine learning is the worst possible place for this insight to land, which is why it is the place where the insight is most needed. Deep learning models are extraordinarily expressive function fitters. That makes them extraordinary at producing right answers when the input has the right shape, and extraordinary at producing confident wrong answers when it doesn’t. They will fit a static-distribution structure to a dynamical question and report regularities that vanish the moment the system shifts regime. They will fit an associational structure to a causal question and report effects that disappear under intervention. The model will be confident in both cases, because confidence is a function of the fit and the fit is a function of the structure you supplied.
AI alignment as a field is broader than this. Reward hacking, distribution shift, inner alignment, strategic deception, and governance are all doing their own work and should not be collapsed. But one of the failure modes inside the field, the one that connects most directly to the rest of this essay, is type-mismatch at industrial scale. We are feeding extraordinarily powerful models the wrong shape of input and getting confidently wrong answers, and there is no compiler to flag the mismatch. Scaling laws make the answers more confident. They do not make them right.
The compressed version of all of this is the one I want to keep. Rigor is not about better models. It is about typing the input. Step zero, before any of the things people argue about under the heading of methodology (pre-registration, severe testing, peer review, open data, replication, multiple-comparison corrections, robust estimators, Bayesian priors, the lot), is the question of what shape your question has and which data structure can hold it. Almost nobody is taught to make this choice consciously. The choice gets made anyway, by default, based on which framework the analyst happened to learn first. The default is almost always statistics, because statistics is what the introductory courses teach and what journal reviewers expect. So statistical models get fed causal questions and dynamical questions and feedback questions and questions about individual cases, and the models return numbers, and the numbers go into papers, and the papers go into citations, and nobody flags the type mismatch because nobody is responsible for type-checking and the field has no language for the failure mode.
The clearest case study of all of this, in March 2020, was the contrast between two modeling traditions inside the early outbreak literature. The IHME team’s first model was a curve-fitting exercise: it extrapolated recent case counts and deaths forward through statistical smoothing, and was widely criticized at the time for having weak transmission dynamics and underestimating the role of intervention timing. The Imperial College team’s Report 9, published March 16, 2020, was built on stochastic transmission models with explicit behavioral and policy feedback. It is widely credited with shifting the UK government from mitigation to suppression within days. Two teams, two data structures, two very different policy outputs from roughly the same underlying data.
The lesson is not “dynamic models are always better than statistical ones.” Sweden ran light policy with heavy reliance on voluntary behavior change and reasoned its way through the outbreak in a third register that doesn’t fit either side of the contrast. The lesson is sharper and narrower: the teams who chose to model the dynamics could produce advice that responded to the feedback structure of the system, and the teams who fit curves to recent data could not. That difference was not analytic skill. It was a choice of data structure made before any analysis began, by people who in most cases never noticed they were making it.
Step zero: type the question. The rest of the methodology is downstream of that, and so are you.