by Carl V Phillips
It is a science lesson bonanza this weekend. This morning I did this Twitter thread about the publication bias and the recent new indictment of the health effects of eating eggs. My first science lesson tutorial is also underway at my Patreon page (open access). The present topic is an outgrowth of an exchange I had with one of my patrons, in which I was helping with the interpretation of a critique of a paper.
Most of science is about trying to estimate a specific measure of a phenomenon like “vaping causes people to quit smoking” (the underlying phenomenon). The measure, however, is something like “how many additional quitters were there in the US over the last five years as a result of vaping?” or “how much more likely is someone to quit if they seriously try to switch to vaping?” There is actually room for debate about whether it is meaningful to talk of phenomena apart from measurement, and other philosophical stuff like that. But though I did a fellowship in philosophy of science, I am basically just a simple scientist who accepts the intuitive notion that there are phenomena in the universe, which exist whether we measure them or not, and we are trying to estimate some measure of them. And, moreover, that the measurements are just Platonic shadows of the phenomenon, not the phenomenon itself. This science lesson looks at some of the implications of different ways of measuring the same phenomena.
That brings us to elephants.
“Elephants are big compared to wolves” is a phenomenon. But there are a lot of different ways to measure it — height, mass, length, volume, etc. Any of these measurements should give us the result “elephants are big compared to wolves.” But the different ways of measuring, as well as the different ways of reporting the results (e.g., the difference (subtraction) versus the ratio (division)) will produce numerically very different results. This complication — that (a) if the phenomenon exists, then any legitimate method of measuring it ought to show it exists and (b) different measurement methods will produce different results — is the source of a great deal of confusion.
One aspect of the confusion is the one I advised about recently: Study 1 measures a phenomenon and shows it exists to a substantial degree, while Study 2 measures the phenomenon using another method that will inevitably produce a smaller measurement, but instead completely fails to detect the phenomenon. What should we conclude? The answer is not “Study 1 should produce a larger measure, so believe it” — not if Study 2 ought to have detected the phenomenon if it existed.
If Study 1 is “capture the animal and put it on a scale” while Study 2 is “yeeeah, we’re not going to do that; instead we will estimate the height of the animal by watching it walk past a tree and then measure the point its head came up to after it is safely gone”, then they ought to both show “elephants are bigger than wolves.” If elephants really are substantially bigger, then both studies should get the same qualitative result. If they get different qualitative results, then a thoughtful effort should be made to figure out the source of error, rather than treating it as a vote in which Study 1 gets to break the tie because it was more expensive.
If vaping causes people to quit smoking, then we ought to detect that in population prevalence statistics, in surveys of people who quit smoking, in surveys of vapers, in sales data, and in experiments where a smoker is given a vape. And we do, of course. But notice I stressed qualitative. These different measures are going to produce different numerical outcomes. Some of these are similar but predictably different, such as vaping being much more likely to become a substitute for smoking if someone actively seeks to switch, as compared to being assigned to try it. Or someone is much more likely to have switched completely after six months of vaping, as compared to the first few weeks after trying it. Other measures are hard to compare, like counts of ex-smoker vapers and estimates of the improved probability of quitting.
But here’s the thing, the reason I was going through all this: One one hand, if the phenomenon is real and fits the hypothesis, then different measures all ought to all see it. On the other hand, they ought to produce different measures. Assume Study 1 and Study 2 of animal size were both done flawlessly. Study 1 (based on weight/mass) is going to show the same qualitative difference, but a much higher ratio than Study 2 (based on height). But on the third hand, the different results should still have a certain consistency that can be examined. In the science related to THR, almost no one ever thinks this last bit through. They just churn out results and treat them as if they were votes.
Recall the study that claimed that only 16,000 English smokers quit because of vaping in 2104, which I dismantled here. Though it was a different measure of the phenomenon, that result was contradicted by such measures as the number of ex-smoker vapers in the population and estimates of the increased probability of quitting thanks to vaping. The authors were actually trying to spin this low result as yet another qualitative result that vaping causes smoking cessation. But it actually was contradicted by the other evidence rather than agreeing with it. It was just wrong.
Differing measures of the same phenomenon need not mean one of them is wrong. But they ought to be reconciled. In the case of the animal size, we know the mass should increase proportional to the cube of the height (more or less — that assumes they two animals have the same shape and density — but it is close enough to check for consistency). So we can assess whether Study 1 and Study 2 actually (roughly) agree.
Epidemiologists and their ilk are far too lazy about just ignoring the quantitative differences among results. They take cheap refuge in the observation that every study is different, and so they treat each result as if it were merely qualitative (“out study, like the previous study, shows X increases Y”). This matters because qualitative comparisons can quantitatively contradict one another. The myth that American-style smokeless tobacco causes nontrivial health risk (or that it is measurably more harmful than Swedish-style products) traces almost entirely to a single study from the 1970s. But that study is a huge outlier, with results that are inconsistent even with the few other studies that show a slightly elevated risk. Those other studies (let alone the ones that show no risk or a protective effect) do not agree with the outlier study just because they suggest an increase in risk; they provide additional evidence that the outlier was wrong.
Returning to our safari, we now want to conduct a study of whether elephants are bigger than giraffes. And, hmm, the methodologies from Study 1 and Study 2 give us flatly contradictory results. It turns out that “bigness” is not a single phenomenon after all. For some comparisons that vague notion works, but for others, mass and height give us different comparisons. They turn out to be different phenomena that are usually closely correlated, but not always.
The outlier smokeless tobacco study was of an archaic product that few still used in the 1970s and basically no one uses now. So it is not quite right to say this study is contradicted by the modern knowledge about the subject. Rather, it was assessing the effects of a different exposure (a different phenomenon). Even worse are suggestions that studies of South Asian dip products (which are not actually smokeless tobacco at all, though the tobacco controllers call them that) offer any information about the phenomenon of American- or Swedish-style products causing health risks.
The various junk studies (mostly out of UCSF) that tobacco controllers claim show vapers are less likely to quit smoking are wrong because they are looking at the wrong phenomenon. We are interested in whether picking up a vape for the purpose of quitting, or the mere presence of vaping in a society, increases the chance of a smoker quitting. Those junk studies, however, all looked at whether people who still smoked after also vaping for a while, or after having tried vaping but choosing to not stick with it, were more likely to quit smoking than their peers who never tried a vape. That is simply not the same thing.
I wish I could sum this all up in a single thesis statement, but I cannot. It is complicated and there are bits of it that go in different directions. I hope I offered some traction for understanding the complication. Different measures of the same phenomenon will produce different numerical results, but should produce similar qualitative results. But qualitative results in the same direction are not always similar, and sometimes contradict one another, so the typical practice of ignoring the quantities is a fatal error in epidemiology and related sciences. (Though as fatal errors go, it is much smaller than the practice of ignoring all the studies that should have detected the phenomenon but failed to do so.) Moreover, what might sound like the same phenomenon when described casually (“vaping” “big” “smokeless tobacco use”) can easily be as fundamentally different as height and mass.