statmodeling.stat.columbia.edu
Open in
urlscan Pro
2606:4700:7::a29f:8a40
Public Scan
URL:
https://statmodeling.stat.columbia.edu/2022/01/07/pnas-gigo-qrp-wtf-approaching-the-platonic-ideal-of-junk-science/
Submission Tags: 0xscam
Submission: On November 17 via api from US — Scanned from DE
Submission Tags: 0xscam
Submission: On November 17 via api from US — Scanned from DE
Form analysis
2 forms found in the DOMGET https://statmodeling.stat.columbia.edu/
<form method="get" id="searchform" action="https://statmodeling.stat.columbia.edu/">
<label for="s" class="assistive-text">Search</label>
<input type="text" class="field" name="s" id="s" placeholder="Search">
<input type="submit" class="submit" name="submit" id="searchsubmit" value="Search">
</form>
POST https://statmodeling.stat.columbia.edu/wp-comments-post.php?wpe-comment-post=statmodeling
<form action="https://statmodeling.stat.columbia.edu/wp-comments-post.php?wpe-comment-post=statmodeling" method="post" id="commentform" class="comment-form">
<p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> <span class="required-field-message">Required fields are marked <span class="required">*</span></span></p>
<p class="comment-form-comment"><label for="comment">Comment <span class="required">*</span></label> <textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" required="required"></textarea></p>
<p class="comment-form-author"><label for="author">Name</label> <input id="author" name="author" type="text" value="" size="30" maxlength="245" autocomplete="name"></p>
<p class="comment-form-email"><label for="email">Email</label> <input id="email" name="email" type="text" value="" size="30" maxlength="100" aria-describedby="email-notes" autocomplete="email"></p>
<p class="comment-form-url"><label for="url">Website</label> <input id="url" name="url" type="text" value="" size="30" maxlength="200" autocomplete="url"></p>
<p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Post Comment"> <input type="hidden" name="comment_post_ID" value="46944" id="comment_post_ID">
<input type="hidden" name="comment_parent" id="comment_parent" value="0">
</p>
<p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="27144d9d1b"></p>
<p style="display: none !important;" class="akismet-fields-container" data-prefix="ak_"><label>Δ<textarea name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea></label><input type="hidden" id="ak_js_1" name="ak_js"
value="1731847501214">
<script>
document.getElementById("ak_js_1").setAttribute("value", (new Date()).getTime());
</script>
</p>
</form>
Text Content
Skip to primary content STATISTICAL MODELING, CAUSAL INFERENCE, AND SOCIAL SCIENCE Search MAIN MENU * Home * Authors * Blogs We Read * Books * Sponsors POST NAVIGATION Shreddergate! And an idea for a Museum of Scholarly Misconduct. Institute for Replication, and the usual concerns PNAS GIGO QRP WTF: THIS META-ANALYSIS OF NUDGE EXPERIMENTS IS APPROACHING THE PLATONIC IDEAL OF JUNK SCIENCE Posted on January 7, 2022 9:18 AM by Andrew Nick Brown writes: > You might enjoy this… Some researchers managed to include 11 articles by > Wansink, including the “bottomless soup” study, in a meta-analysis in PPNAS. Nick links to this post from Aaron Charlton which provides further details. The article in question is called “The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains,” so, yeah, this pushes several of my buttons. But Nick is wrong about one thing. I don’t enjoy this at all. It makes me very sad. An implausibly large estimate of average effect size Let’s take a look. From the abstract of the paper: > Our results show that choice architecture interventions [“nudging”] overall > promote behavior change with a small to medium effect size of Cohen’s d = 0.45 > . . . Wha . . .? An effect size of 0.45 is not “small to medium”; it’s huge. Huge as in implausible that these little interventions would shift people, on average, by half a standard deviation. I mean, sure, if the data really show this, then it would be notable—it would be big news—because it’s a huge effect. Why does this matter? Who cares if they label massive effects as “small to medium”? It’s important because it’s related to expectations and, from there, to the design and analysis of experiments. If you think that a half-standard-deviation effect size is “small to medium,” i.e. reasonable, then you might well design studies to detect effects of that size. Such studies will be super-noisy to the extent that they can pretty much only detect effects of that size or larger; then at the analysis stage researchers are expecting to find large effects, so though forking paths they find them and through selection that’s what gets published, leading to a belief from those who read the literature that this is how large the effects really are . . . it’s an invidious feedback loop. There are various ways to break the feedback loop of noisy designs, selection, and huge reported effect sizes. One way to cut the link is by preregistration and publishing everything; another is less noisy studies (I’ll typically recommend better measurements and within-person designs); another is to critically examine the published literature in aggregate (as in the work of Gregory Francis, Uri Simonsohn, Ulrich Schimmack, and others); another is to look at what went wrong in particular studies (as in the work of Nick Brown, Carol Nickerson, and others); one can study selection bias (as in the work of John Ioannidis and others); and yet another step is to think more carefully about effect sizes and recognize the absurdity of estimates of large and persistent average effects (recall the piranha problem). The claim of an average effect of 0.45 standard deviations does not, by itself, make the article’s conclusions wrong—it’s always possible that such large effects exist—but it’s a bad sign, and labeling it as “small to medium” points to a misconception that reminds us of the process whereby these upwardly biased estimates get published. What goes into the sausage? Pizzagate . . . and more What about the Wansink articles? Did this meta-analysis, published in the year 2022, really make use of 11 articles authored or coauthored by that notorious faker? Ummm, yes, it appears the answer to that question is Yes: I can see how the authors could’ve missed this. The meta-analysis makes use of 219 articles (citations 16 through 234 in the supplementary material). It’s a lot of work to read through 219 articles. The paper has only 4 authors, and according to the Author contributions section, only two of them performed research. If each of these authors went through 109 or 110 papers . . . that’s a lot! It was enough effort just to read the study descriptions and make sure they fit with the nudging theme, and to pull out the relevant effect sizes. I can see how they might never have noticed the authors of the articles, or spent time to do Google or Pubpeer searches to find our if any problems had been flagged. Similarly, I can see how the PNAS reviewers could’ve missed the 11 Wansink references, as they were listed deep in the Supplementary Information appendix to the paper. Who ever reads the supplementary information, right? The trouble is, once this sort of thing is published and publicized, who goes back and checks anything? Aaron Charlton and Nick Brown did us a favor with their eagle eyes, reading the paper with more care than its reviewers or even its authors. Post-publication peer review ftw once more! Also check out this from the article’s abstract: > Food choices are particularly responsive to choice architecture interventions, > with effect sizes up to 2.5 times larger than those in other behavioral > domains. They didn’t seem to get the point that, with noisy studies, huge effect size estimates are not an indicator of huge effects; they’re an indication that the studies are too noisy to be useful. And that doesn’t even get into the possibility that the original studies are fraudulent. But then I got curious. If this new paper cites the work of Wansink, would it cite any other embarrassments from the world of social psychology? The answer is a resounding Yes!: This article was retracted in a major scientific scandal that still isn’t going away. The problems go deeper than any one (or 12) individual studies Just to be clear: I would not believe the results of this meta-analysis even if it did not include any of the above 12 papers, as I don’t see any good reason to trust the individual studies that went into the meta-analysis. It’s a whole literature of noisy data, small sample sizes, and selection on statistical significance, hence massive overestimates of effect sizes. This is not a secret: look at the papers in question and you will see, over and over again, that they’re selecting what to report based on whether the p-value is less than 0.05. The problem here is not the p-value—I’d have a similar issue if they were to select on whether the Bayes factor is greater then 3, for example—; rather, the problem is the selection, which induces noise (through the reduction of continuous data to a binary summary) and bias (by not allowing small effects to be reported at all). Another persistent source of noise and bias is forking paths: selection of what analyses to perform. If researchers were performing a fixed analyses and reporting results based on a statistical significance filter, that would’ve been enough to induce huge biases here. But, knowing that only the significant results will count, researchers are also free to choose the details of their data coding and analysis to get these low p-values (see general discussions of researcher degrees of freedom, forking paths, and the multiverse), leading to even more bias. In short, the research method used in this subfield of science is tuned to yield overconfident overestimates. And when you put a bunch of overconfident overestimates into a meta-analysis . . . you end up with an overconfident overestimate. In that case, why mention the Wansink and Ariely papers at all? Because this indicates the lack of quality control of this whole project—it just reminds us of the attitude, unfortunately prevalent in so much of academia, that once something is published in a peer-reviewed journal, it’s considered to be a brick of truth in the edifice of science. That’s a wrong attitude! If an estimate produced in the lab or in the field is noisy and biased, then it’s still noisy and biased after being published. Indeed, publication can exacerbate the bias. The decision in this article to include of multiple publications by the entirely untrustworthy Wansink and an actually retracted paper by the notorious Ariely is just an example of this more general problem of taking published estimates at face value. P.S. It doesn’t make me happy to criticize this paper, written by four young researchers who I’m sure are trying their best to do good science and to help the world. So, you might ask, why do we have to be so negative? Why can’t we live and let live, why not celebrate the brilliant careers that can be advanced by publications in top journals, why not just be happy for these people? The answer is, as always, that I care. As I wrote a few years ago, psychology is important and I have a huge respect for many psychology researchers. Indeed I have a huge respect for much of the research within statistics that has been conducted by psychologists. And I say, with deep respect for the field, that it’s bad news that its leaders publicize work that has fatal flaws. It does not make me happy to point this out, but I’d be even unhappier to not point it out. It’s not just about Ted talks and NPR appearances. Real money—real resources—get spent on “nudging.” Brian Wansink alone got millions of dollars of corporate and government research funds and was appointed to a government position. The U.K. government has an official “Nudge Unit.” So, yeah, go around saying these things have huge effect sizes, that’s kind of an invitation to waste money and to do these nudges instead of other policies. That concerns me. It’s important! And the fact that this is all happening in large part because of statistical errors, that really bothers me. As a statistician, I feel bad about it. And I want to convey this to the authors and audience of the sort of article discussed above, not to slam them but to encourage them to move on. There’s so much interesting stuff to discover about the world. There’s so much real science to do. Don’t waste your time on the fake stuff! You can do better, and the world can use your talents. P.P.S. I’d say it’s kind of amazing that the National Academy of Sciences published a paper that was so flawed, both in its methods and its conclusions—but, then again, they also published the papers on himmicanes, air rage, ages ending in 9, etc. etc. They have their standards. It’s a sad reflection on the state of the American science establishment. P.P.P.S. Usually I schedule these with a 6-month lag, but this time I’m posting right away (bumping our scheduled post for today, “At last! Incontrovertible evidence (p=0.0001) that people over 40 are older, on average, than people under 40.”), in the desperate hope that if we can broadcast the problems with this article right away, we can reduce its influence. A little nudge on our part, one might say. Two hours of my life wasted. But in a good cause. Let me put it another way. I indeed think that “nudging” has been oversold, but the underlying idea—“choice architecture” or whatever you want to call it—is important. Defaults can make a big difference sometimes. It’s because I think the topic is important that I’m especially disappointed when it gets the garbage-in, garbage-out junk science treatment. The field can do better, and an important step in this process of doing better is to learn from its mistakes. P.P.P.P.S. More here. This entry was posted in Decision Analysis, Economics, Political Science, Zombies by Andrew. Bookmark the permalink. 47 THOUGHTS ON “PNAS GIGO QRP WTF: THIS META-ANALYSIS OF NUDGE EXPERIMENTS IS APPROACHING THE PLATONIC IDEAL OF JUNK SCIENCE” 1. Iain on January 7, 2022 9:39 AM at 9:39 am said: Just curious, if a researcher/team was aware of the Wansinck/Ariely et al. issues and wanted to exclude their papers, how would they word the pre-reg/protocol for the systematic review? Some, but not all of the 12 papers were retracted, so excluding retractions wouldn’t work. Is there a rigorous way to pre-specify how bad papers could be excluded? Reply ↓ * Andrew on January 7, 2022 9:48 AM at 9:48 am said: Iain: I think it would make sense to start by removing any articles where Wansink was involved, given the serious problems that have been found in so many of the papers. And, of course, yeah, remove any articles that have actually been retracted. But even after that, as noted above, I would not trust the published estimates that went into the meta-analysis. I think you’d really want to go back to the raw data, or to restrict the analysis to preregistered work. This is a frustrating message, because the implication is that there are these 219 published papers out there that we just can’t directly use. If you really want to use them, I think you’d have to go through them carefully, one at a time, and figure out what each one is saying. I’m guessing that most of the studies, beyond any other flaws, are just too noisy to learn anything useful about realistic effect sizes. But maybe there are a few papers there with some valuable data. And maybe the qualitative ideas in those papers could be helpful in designing future studies. The idea that you can combine a couple hundred studies of unknown quality and hope to learn something real, though: Nah, I don’t think so. Removing the worst papers is a start, but it wouldn’t solve the problem, it would just push it back a step. Reply ↓ * Iain on January 7, 2022 10:09 AM at 10:09 am said: Thanks for responding Andrew. I agree with your points and my conclusions about the effect sizes in the review are the same. Notwithstanding, I guess I just have sympathy for any researcher who knows about problems with another researcher’s papers and wants to state their selection criteria up front and publicly. There doesn’t seem to be a risk-free and systematic way to do this. Reply ↓ * Andrew on January 7, 2022 10:15 AM at 10:15 am said: Iain: I guess that, in this case, restricting to papers where the raw data were available would’ve eliminated all the Wansink and Ariely publications. * Keith O'Rouke on January 7, 2022 10:30 AM at 10:30 am said: I believe this very concern lead Doug Altman and Cochrane’s Statistical Methods Group to “ban” the use of study quality as being offensive and to strongly discourage (disallow?) it use. Instead they promoted the term risk of bias which overlooks avoidable excess variation (noise) in a study. I think study quality is returning to more common use. In my publications on study quality we tried to define what we meant by quality in a way that did not necessarily indict the authors of being poor researchers. For instance ” ‘quality’ (whatever leads to more valid results) is of fairly high dimension and possibly non-additive and nonlinear, and that quality dimensions are highly application-specific and hard to measure from published information ” https://pubmed.ncbi.nlm.nih.gov/12933636/ Unfortunately the academic cost of any suggestion that you are not so smart, you always sneeze marble leads to a lot of sensitivity on the part of many authors. * clint on January 7, 2022 3:31 PM at 3:31 pm said: Not at all important, nor really relevant, but… is “you are not smart, you always sneeze marble” an expression? Or a helpfully auto-corrected phrase? Either way, it’s marvelous. * Steve on January 7, 2022 5:21 PM at 5:21 pm said: Let’s agree that “you are not smart, you always sneeze marble” should be an expression and work to make it so. * Andrew on January 8, 2022 12:06 AM at 12:06 am said: Ho ho, that’s rich. * John N-G on January 8, 2022 5:06 PM at 5:06 pm said: I tried googling it. After excluding entries that failed to include the word “marble”, this blog was the top hit. Number two was https://philamuseum.org/collection/object/51585 which as far as I am concerned ought to be adopted as the (position currently vacant) visual icon for this blog. * Keith O'Rourke on January 10, 2022 10:28 AM at 10:28 am said: It was from the Amadeus movie Wolfgang … Or Horatius, or Orpheus… people so lofty they sound as if they sh*t marble!;-) * jim on January 7, 2022 10:06 AM at 10:06 am said: Why do you need a “rigorous” method to exclude the work of a convicted fraudster? Science is about making sense of things. You can’t make sense of things with bogus research. But then again in my mind a “meta-analysis” of “nudge” interventions is garbage from the get-go. If I wanted to bother going through every study in the meta-analysis, I’m confident I could find a lethal problem in every paper that showed a significant effect, and probably every paper. Reply ↓ * Andrew on January 7, 2022 10:14 AM at 10:14 am said: If it helps, here’s an example of an informal literature review that I did on the topic of ballot effects. The review could’ve been done better, but the point here is that the review was integrated with the theory in the sense that we had a sense of where we would expect to see larger and smaller effects. Throwing 219 studies (had they existed) into the blender wouldn’t have done that. Reply ↓ * C on January 7, 2022 1:37 PM at 1:37 pm said: I’m not sure if this really addresses your specific concern, but I think in this particular manuscript one could write, “We didn’t anticipate having to decide whether to include researchers with a documented history of fraud/mismanagement of data, but given concerns about Wansink’s work in the last few years, we’ve opted to not include it,” or something. Preregistrations don’t mean you can’t veer from them, but that if you do veer from them, you transparently report it and your reasoning for doing so. For future preregistrations, I think there could be room for some signaling that the team would check for fraud against, say, Retraction Watch’s database and flag articles where there are concerns. I don’t know specific wording off the top of my head, but that could be a starting place that people could improve on. Reply ↓ 2. Wonks Anonymous on January 7, 2022 10:01 AM at 10:01 am said: > published in the year 2002 I think you meant to write 2022, the actual year listed in the link. One could be forgiven for not knowing better about Wansink in 2002, but one could not be forgiven for citing papers he wouldn’t even write until years later! Reply ↓ * Andrew on January 7, 2022 10:08 AM at 10:08 am said: Typo fixed; thanks. Reply ↓ 3. Dale Lehman on January 7, 2022 10:21 AM at 10:21 am said: It makes me sad too. Hypothetical: Researcher A: publishes paper citing numerous questionable and, in some cases, retracted publications. Researcher A gets several such publications while getting their PhD. Perhaps they even get a TED talk, if they pick the right topic, such as nudging. Researcher B: careful to select only publications that have stood up to post-publication review, written by authors whose reputations have remained intact, and who publicly release their data upon publication (if not before). B’s PhD thesis results in publication in a minor journal and no TED talks. Whose career is likely to be more successful, where success is measured by their ability to land a tenured job at a reasonable university? I think we know the answer. It reflects many things wrong: graduate training, peer review, accreditation and assessment, tenure and promotion policies, professional self-regulation, etc. So many things need to change – and even if we know what direction that change should be, it isn’t clear now to move that way. Removing all those faulty measures and policies leaves us with a vacuum about how to evaluate the quality of research and researchers. And, a by product (perhaps the most important ramification) is that the public rightfully distrusts anything “experts” or “analysis” reveals, unless of course it reinforces their prior beliefs. As I said, it makes me sad. Reply ↓ 4. jd on January 7, 2022 11:33 AM at 11:33 am said: How many nudges make a shove? Is it multiplicative or additive or what? Whenever I see the nudge thing, it strikes me as an odd idea because wouldn’t I be surrounded by so many potential small nudges in every direction that if nudging were a thing then my decision making is sort of like a pinball being nudged around to and fro? It seems an odd idea from the start. (but I know practically nill about psychology) Reply ↓ * Andrew on January 7, 2022 11:38 AM at 11:38 am said: Jd: Yes, that’s the piranha problem. Reply ↓ 5. Nick on January 7, 2022 11:50 AM at 11:50 am said: To be fair, the article makes clear that they only used studies 1 and 2 from the Shu et al. paper, and the one where everyone agrees that the data are fake was study 3. Reply ↓ * Aaron Charlton on January 7, 2022 12:03 PM at 12:03 pm said: Those are the two studies that the original authors claimed to be nonreplicable in their follow-up PNAS. Reply ↓ * Nick on January 7, 2022 3:34 PM at 3:34 pm said: Doh! Reply ↓ * Dale Lehman on January 7, 2022 12:28 PM at 12:28 pm said: Good observation – it made me look more closely and they have the following footnote: “Please note that our results are robust to the exclusion of nonretracted studies by the Cornell Food and Brand Laboratory which has been criticized for repeated scientific misconduct; retracted studies by this research group were excluded from the meta-analysis.” I am encouraged by this, as it is a step in the right direction. It doesn’t impact most of Andrew’s objections above, but it does alleviate some of my concerns. Reply ↓ * Andrew on January 7, 2022 12:53 PM at 12:53 pm said: Dale: Yeah, but . . . one of those nonretracted studies is the one with the bottomless soup bowl. Given that there is no good evidence that the soup bowl experiment ever happened, I’d say it has the same ontological status as Dan Ariely’s paper shredder and Mary Rosh’s surveys. More generally, trusting 11 of Wansink’s papers because they haven’t yet been retracted is like, oh, I dunno, pick your analogy here. Here’s a funny thing. Suppose the authors had excluded the suspicious studies from their meta-analysis and had then done a robustness study etc. Then they could’ve said, “Here are our results, and guess what? When we throw in 11 more studies that are, at worst, fabricated and, at best, massively p-hacked, our conclusions don’t change!” Doesn’t sound so good when you put it that way! What bothers me is a kind of split-the-difference attitude that is, ultimately anti-scientific. The reasoning goes something like this: Some people think Wansink’s work is crap, on the other hand it was published in real journals, so let’s try it both ways and see what we get. But that’s not right at all! If you take a massively-biased estimate and include it halfway, you’ll still have a half-massively-biased estimate. This kind of open-mindedness-to-crap doesn’t average out; it just gives you crap. Rotten apples in the barrel and all that. But, yeah, the big problem is not those 12 papers that should be thrown out without a blink; it’s the results from the other 200 or so papers, most of which I expect are subject to huge biases for reasons discussed in the above post. Reply ↓ * gec on January 7, 2022 12:53 PM at 12:53 pm said: “This does not change the conclusion of the paper.” Okay, okay, more seriously: It doesn’t alleviate my concerns because the comment suggests their primary consideration regarding inclusion/exclusion is the reputation of the researchers involved in the work, rather than the content of the work itself. Rather than engage with the study design and data to figure out what we might be able to add to our pool of knowledge, they just accept a published article as “given” if it is attached to the right type of people. Of course, Andrew’s post mentions a reason why they didn’t do more in-depth kind of evaluation: it is hard and takes time. If only doing quality work was easy and quick! Reply ↓ 6. Tobias Brosch on January 7, 2022 12:34 PM at 12:34 pm said: We would like to comment on the above blog post and clarify some points that may not have been sufficiently clearly pointed out in our article, but that are important for the correct interpretation of the findings. The author perceives a “lack of quality control” because of the inclusion of several nonretracted papers/studies by authors who have been criticized for scientific misconduct in other, now retracted papers. These papers were identified based on our predefined selection criteria which are transparently reported in the paper. As these papers are not retracted, we considered them a part of the published scientific literature. We did thus have no justification to exclude them a priori from a meta-analysis that aimed to be a comprehensive representation of the published literature. We would rather find it problematic to introduce a “subjective” a priori selection based on our appraisal of the work of individual researchers. All retracted papers from the Wansink group were excluded from our analyses. As we were of course aware of the problems with Brian Wansink’s research, we moreover ran additional analyses which excluded all papers (co-)authored by Wansink, including the nonretracted ones. This did not change the pattern of results (the data and script for this robustness analysis are available on the OSF). We had pointed this out in a footnote in the paper, but recognize that it may have warranted a more prominent mention in the main text. As for the paper by Shu et al. (2012), we would like to point out that we did in fact exclude the highly criticized field experiment conducted by Dan Ariely from our analyses (this exclusion occurred during the revision stage of our paper when the paper by Shu et al. was being critically discussed, but had not yet been officially retracted). In the absence of any evidence pointing to similar scientific misconduct in the implementation and analysis of the lab experiments reported in the paper, we decided to include these lab experiments in our analyses as we considered them a valid part of the scientific literature. Reply ↓ * Andrew on January 7, 2022 1:03 PM at 1:03 pm said: Tobias: Thanks for the note. See my comment above and also this comment for followup on these concerns. My quick summary is that, even if there were no dispute about several of these papers (and perhaps other papers in this literature), I still think this meta-analysis is essentially useless if the goal is to estimate the effect of nudges, because the individual estimates are just too biased. To put it another way: it’s not that I think you did the meta-analysis wrong, it’s just that I don’t think this literature is good enough for it to support the kind of meta-analysis you want to do. And, as a statistician and textbook writer, I feel bad about this, because I think that statistics textbooks (including my own) focus too much on methods and not enough on data quality. There’s something horrible about statisticians such as myself writing textbooks that talk about meta-analysis and other complicated methods while barely mentioning data quality and selection bias—and then turning around and criticizing a published paper for these same flaws. But we have to move forward somehow—and I really don’t want researchers and policymakers thinking that nudges have average effects of 0.5 standard deviations, given that there’s no real evidence for such a claim. Reply ↓ * Andrew on January 7, 2022 1:08 PM at 1:08 pm said: P.S. And thanks for responding to the post. I know it can be hard to have your research get lots of publicity and attention and then for it to be criticized—it’s happened to me! In that case, I was able to make use of the criticism to improve what I’d done. Similarly, I hope the comments here will be helpful to you and your colleagues going forward. Reply ↓ * Keith O'Rourke on January 7, 2022 1:26 PM at 1:26 pm said: > while barely mentioning data quality and selection bias And when you raise that issue with statistical authors they feel attacked and maligned – https://statmodeling.stat.columbia.edu/2017/11/01/missed-fixed-effects-plural/ The relevant issue here being “fixed effects estimate is an estimate of some populations average only if the between study variation is not importantly driven by design (AKA study quality or methodological) variation. This kind of variation is usually/mostly the result of haphazard biases and has different implications for what is to be made of the variation and expectation.” Reply ↓ * Shravan Vasishth on January 8, 2022 4:39 AM at 4:39 am said: Andrew, you wrote: “I don’t think this literature is good enough for it to support the kind of meta-analysis you want to do” That statement is probably true for most published studies on other topics and in other areas of science. I think that meta-analyses should be seen as evidence synthesis, given the data, such as it is. When can one ever take a meta-analysis seriously, as in telling us something important about the phenomenon? Maybe in medicine (Cochrane reviews)? This is a genuine question. What’s a good example of high-quality studies in any field, with the result that the meta-analysis truly advanced our understanding? Reply ↓ * Jacob Manaker on January 8, 2022 12:26 PM at 12:26 pm said: “As these papers are not retracted, we considered them a part of the published scientific literature. We did thus have no justification to exclude them a priori from a meta-analysis that aimed to be a comprehensive representation of the published literature.” Did you include articles from the International Journal of Psychology and Behavioral Research? …International Journal of Indian Psychology? …Journal of Psychology and Theology? (In case it wasn’t clear by now, these are all from Beall’s list.) Did you even bother to check whether they included (purportedly) relevant articles or discussion? You’re already selecting and excluding some articles based on whether you trust the journal. What makes the journal special, so that you shouldn’t apply the same sort of critical eye to authors? To put it another way: there is no coherent, a priori definition of scientific vs. non-scientific. Instead, each researcher decides for themselves which past work is science and which is just folklore and mythology. It would have been perfectly acceptable for you to define “the scientific literature” as “papers published in reputable, peer-reviewed journals by authors without a history of major, repeated frauds” and anything else as “folklore”. But you chose not to define it this way. You should reconsider. (This is not your only problem, as Andrew points out in his comment. But it leapt out at me from your comment as something easy to refute.) Reply ↓ 7. Raghu Parthasarathy on January 7, 2022 1:06 PM at 1:06 pm said: I spent a few minutes — probably more than this entire field deserves — looking at this paper. Figure 3 is amazing: the effect sizes of all the examined papers. It’s clearly highly skewed by a small fraction of very large reported effect sizes. The authors even note, for the next figure, that “Visual inspection … revealed an asymmetric distribution that suggested a one-tailed overrepresentation of positive effect sizes… these results point to a publication bias in the literature.” But then rather than sensibly concluding that one can’t sensibly conclude anything from these studies, they imagine that they can model “a moderate one-tailed publication bias” and conclude that the actual overall effect size is d = 0.31 rather than 0.42. They then note “a severe one-tailed publication bias attenuated the overall effect size even further to d=0.03” — i.e. nothing — but they’re dismissive of this conclusion. I initially typed out the reason they’re dismissive of it, but decided that would be too mean. d = 0.03 doesn’t make it into the abstract, I note. Thanks, Andrew, for posting this. Reply ↓ * Lukas Lohse on January 7, 2022 4:39 PM at 4:39 pm said: Figure 3 is amazing! The Twitter thread noted how to seemed like it moved towards 0 as the SE got smaller, so i had to take a look myself. Now cerdit wher credit is due: They did share their data: https://osf.io/78a5n/ Check out how closely the loess-fit follows the critical value for 95%-significance: https://pasteboard.co/yi2zZOZ2HG9C.png R-Code: library(ggplot2) # https://osf.io/78a5n/ # setwd(…) dat <- read.csv("mhhb_nma_data.csv", as.is = T) dat$wansik <- grepl("*wansik*", tolower(dat$reference)) ggplot(data = dat, aes(y = cohens_d, x = sqrt(variance_d))) + geom_point(aes(color = wansik, alpha = wansik), size = 3) + scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) + scale_alpha_manual(values = c("TRUE" = 1, "FALSE" = 0.5)) + geom_abline(intercept = 0, slope = qnorm(0.975), lty = 2, color = "blue", size = 2) + geom_label(x = 0.55, y = qnorm(0.975) * 0.55 -0.3, label = "95% significance") + geom_smooth() + coord_cartesian(xlim = c( -0.005, 0.65), ylim= c(-1, 5), expand = FALSE) Reply ↓ 8. JFA on January 7, 2022 1:11 PM at 1:11 pm said: Here’s a different take on defaults and organ donation: https://www.jasoncollins.blog/does-presuming-you-can-take-a-persons-organs-save-lives/ Reply ↓ * Dale Lehman on January 7, 2022 3:18 PM at 3:18 pm said: I think there is an endogeneity problem with the organ donation example. Countries choose whether to use an opt-in or opt-out approach but not in a vacuum. It isn’t the DMV in isolation making that decision: they are influenced by public sentiment and political influences. So, I’m not surprised to see 99% of Austrians going with the default opt-in option while much lower percentages of Germans actively opt-in. Surely the German authorities had some idea of how the German public perceives organ donation when they decided to use the default opt-out. As the Collins blog post shows, there is the further issue of whether opting in really means you opted in (so there are more effective designs as suggested by Thaler). But this is also part of the endogeneity issue: there are public expectations regarding how the opt-in or opt-out choice is put into practice. When we observe the % adopting the default, that % reflects the design of the form, the expectation of how it will be implemented, and the cultural feelings about organ donation. To some extent, this makes the dramatic effect of the nudge appear larger than it really is (I’m not denying that there is an effect, just that the dramatic effect may not be due to the design of the form). Reply ↓ 9. MH on January 7, 2022 1:35 PM at 1:35 pm said: Whenever I see one of these articles I always look at who served as the editor. In this case it’s Susan Fiske, again, as it often seems to be. Sigh… Reply ↓ * Mike on January 7, 2022 8:43 PM at 8:43 pm said: First thing I looked for as well. I guess for some people the winds haven’t really changed much. Reply ↓ 10. C on January 7, 2022 7:33 PM at 7:33 pm said: This sort of approach to science is pretty standard in psychology. Statistics is mostly treated as a sort of ritual. That’s because mathematical training is (mostly) lacking in psychologists and thus stats is treated as some strange dogma handed down from on high, rather than something that needs careful thought and design. It’s why I left the field. Reply ↓ 11. Shravan Vasishth on January 8, 2022 4:21 AM at 4:21 am said: Andrew, one thing worth pointing out is that even if they included some nonsensical/retracted studies in their meta-analysis, those studies will probably not have a major impact on the posterior of the overall effect. You could drop all those studies and still get similar estimates. The second thing to notice is that the confidence interval on the estimate of the effect size is **huge**. It ranges from -.48 to 1.39. The conclusion should have been that overall the estimate is not consistent with the effect being present. Am I missing something here? Also, they say “Extracted Cohen’s d values ranged from –0.69 to 4.69.” That’s the characteristic oscillation one sees in low powered studies (Gelman and Carlin 2014, and many people before them, like Button et al., etc.) The funnel plot is also showing what should have been obvious—heavy suppression of unpleasant findings. The Data supplement section is crying out for some data and code… This is 2022; are we still not releasing data and code with papers? Even psycholinguistics now mandates data+code release (Open Mind, JML, Glossa Psycholinguistics, at least). I didn’t read the paper but the fact that this work was done by young researchers (something Andrew mentioned) makes me think that the real culprits here are their advisor(s). They should have done a better job in educating their students. Another point in this post struck me: “Because this indicates the lack of quality control of this whole project—it just reminds us of the attitude, unfortunately prevalent in so much of academia, that once something is published in a peer-reviewed journal, it’s considered to be a brick of truth in the edifice of science. That’s a wrong attitude!” It’s not just lack of quality control in the whole project, it also indicates lack of quality control at the peer review stage, and damages the reputation of the editors and reviewers. I have seen this kind of thing happening even in for-profit journals like Frontiers in Psychology. There, someone published a paper on Chinese relative clauses (the history of this topic is absolutely hilarious, and I will write about it some day; it shows that even a one-sample t-test is far, far beyond the reach of psycholinguists with 30+ years of research behind them). The paper was edited and reviewed by pretty famous psycholinguists. I reanalyzed the data as soon as I saw the paper, because the results didn’t make any sense to me given what I know about this topic. Even the basic analyses were wrong. I contacted the editor and one of the reviewers and told them about the errors; they contacted the authors. I suggested that they should retract that paper because the conclusions were all incorrect even given their own data and analyses. The response from the authors was: we prefer not to retract the paper. Punkt. Nothing happened! Luckily nobody reads these papers, it’s not like Frontiers in Psychology is PNAS. And anyway, as a famous psycholinguist once told me, what does it matter if an analysis is wrong? This isn’t medicine, nobody gets hurt. The usual response I get from people when I complain about the poor quality of peer review is that people are busy and just take the results on trust. Also, someone once told me that one should trust the scientist, and assume that they did everything right. But I know from my own experience that I don’t do things right despite paying careful attention—careful reviewers have caught basic mistakes in our code and corrected them before the paper was published. People need to get off twitter and instead do a proper peer review, including carefully looking at the data and code themselves. Reply ↓ * Shravan Vasishth on January 8, 2022 4:25 AM at 4:25 am said: Oh my bad, they did share the data: https://osf.io/78a5n/ I guess I should have read all the comments before posting my comment. Sorry about that. Reply ↓ * Andrew on January 8, 2022 9:31 AM at 9:31 am said: Shravan: You write, “even if they included some nonsensical/retracted studies in their meta-analysis, those studies will probably not have a major impact on the posterior of the overall effect. You could drop all those studies and still get similar estimates.” Yes, that’s why I’d say the big problem with this meta-analysis is not the 12 studies that are highly suspect; it’s the other 200 or so which also are produced by a process that leads to highly noisy and biased estimates. It really bothers me when people think that you can combine 200 crappy data points and get something useful out of it. It’s contrary to the principles of statistics—but, then again, if you look at statistics textbooks, including my own, you’ll see lots and lots about data analysis and very little about data quality. So it’s hard for me to want to “blame” the authors for this paper: it’s bad stuff, but they’re following the path that has been taught to them. Reply ↓ * Shravan Vasishth on January 8, 2022 11:16 AM at 11:16 am said: Agree with you; my own meta-analyses are based on what you call crappy data points. We evaluated a computational model’s predictions against these data points, but the editor desk-rejected our paper not because there was a problem with our modeling, but because the data were obviously so crappy (flapping around around 0 and wide CIs). In this case, we got rejected because of what other people did :). Reply ↓ * Shravan Vasishth on January 9, 2022 6:04 AM at 6:04 am said: Andrew, you wrote: “It really bothers me when people think that you can combine 200 crappy data points and get something useful out of it.” One way for the authors to prove you wrong would be to run a new non-crappy study and show that their meta-analysis estimate is consistent with the new (presumably not so noisy) estimate. One problem with this is of course the wide uncertainty interval of the meta-analysis estimate; pretty much any estimate would overlap with that meta-analysis range. In recent replication attempts, I tried to get precise estimates of effects that I had meta-analysis estimates of (based on crappy data as you mention above). To my surprise, the posteriors from my replication attempts are quite startlingly close to my meta-analysis estimates. This either means that the original studies were noisy but not quite as crappy when considered all together (which would contradict your point about doing meta-analyses with crappy data), or that I just got lucky in my replication attempts. In any case, there is, in principle, a way to find out whether your statement is correct that doing meta-analyses with crappy data is a useful thing to do or not. Reply ↓ * psyoskeptic on January 10, 2022 5:21 PM at 5:21 pm said: I went and looked… yeah, that funnel plot… that’s one heck of a lot of bias. They really needed the contour plot with the funnel centred on 0 to make it clear how much of this is lining up on the .05 line. Reply ↓ 12. Keith O'Rourke on January 9, 2022 8:38 AM at 8:38 am said: Unfortunately only if you are lucky rather than unlucky. You are adopting a data based selection rule which can make things worse. I dealt with that here https://www.researchgate.net/publication/241055684_Meta-Analysis_Conceptual_Issues_of_Addressing_Apparent_Failure_of_Individual_Study_Replication_or_Inexplicable_Heterogeneity (think I scanned the paper in once). But a simple simulation, generate a biased study with small SE and an unbiased study with large SE and combine only when they seem consistent and evaluate the confidence coverage for just the combined studies (here is the selection problem). As Airy put in the 1800s – systematic error is really evil. Reply ↓ * Shravan Vasishth on January 9, 2022 9:21 AM at 9:21 am said: Hi Keith, thanks for sending the paper (by email). I just read it. If I understand it correctly, you suggest limiting meta-analyses only to unflawed (unconfounded) studies. Fair enough; but this is hard or impossible to do in practice. Some systematic bias modeling is called for, but it’s very time consuming, hence impractical: Turner, R. M., Spiegelhalter, D. J., Smith, G. C., & Thompson, S. G. (2009). Bias modelling in evidence synthesis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 21-47. I have a feeling that we have discussed this once before. I tried the Turner et al approach in my MSc dissertation from Sheffield, it was a nightmare trying to figure out all the biases and trying to quantify their impact. It’s a pity that getting things right takes so much effort. Reply ↓ * Keith O'Rourke on January 9, 2022 10:36 AM at 10:36 am said: Reality doesn’t care ;-) In 2006 I tried out Sander Greenland’s approach to multiple bias – nice overview here – Good practices for quantitative bias analysis https://academic.oup.com/ije/article/43/6/1969/705764 But I backed out of the paper as the clinician was unwilling to seriously try to figure out all the biases and trying to quantify their impact. Getting things wrong is so much easier… Reply ↓ 13. Adede on August 2, 2022 3:49 PM at 3:49 pm said: FYI, PNAS published a comment on this paper: https://www.pnas.org/doi/10.1073/pnas.2200732119 Reply ↓ LEAVE A REPLY CANCEL REPLY Your email address will not be published. Required fields are marked * Comment * Name Email Website Δ * Art * Bayesian Statistics * Causal Inference * Decision Analysis * Economics * Jobs * Literature * Miscellaneous Science * Miscellaneous Statistics * Multilevel Modeling * Papers * Political Science * Public Health * Sociology * Sports * Stan * Statistical Computing * Statistical Graphics * Teaching * Zombies 1. Andrew Morral on Code it! (patterns in data edition)November 17, 2024 7:08 AM We are very good at finding patterns in data after the fact, and the probability that those patterns occurred by… 2. Daniel Lakeland on Prediction markets in 2024 and poll aggregation in 2008November 16, 2024 9:11 PM In Bayesian stats "doing better" means giving higher probability to the outcomes that happen and lower probability to the outcomes… 3. Joshua on Prediction markets in 2024 and poll aggregation in 2008November 16, 2024 5:29 PM Daniel - Wary of garbage time, I'll add one more comment.. I recognize the difference between "did they do better… 4. Daniel Lakeland on Prediction markets in 2024 and poll aggregation in 2008November 16, 2024 11:21 AM Now You're asking a question like "do prediction markets consistently do better" whereas I'm answering the original question "I don't… 5. Matt Skaggs on Code it! (patterns in data edition)November 16, 2024 10:24 AM The linked paper has this quote: "The problem with putting a number on the remarkableness of this detail is, of… 6. Anonymous on Code it! (patterns in data edition)November 16, 2024 9:43 AM This reminds me of the Torah Code nonsense: https://www.math.toronto.edu/~drorbn/Codes/Chance.pdf 7. Joshua on Prediction markets in 2024 and poll aggregation in 2008November 16, 2024 8:50 AM ...vis a vis the outcome... 8. Joshua on Prediction markets in 2024 and poll aggregation in 2008November 16, 2024 8:49 AM I certainly have to take what you guys say on something like this at face value. But I still have… 9. Ellie Kesselman on Freakonomics does it again (not in a good way). Jeez, these guys are credulous:November 16, 2024 1:15 AM Andrew is correct and Steve Levitt and Ellen Langer should feel ashamed. This is what Langer claims, and Levitt enthuses… 10. Daniel Lakeland on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 8:10 PM A Bayesian probability is not a frequency, it's an assignment of credence to the outcome. The higher credence you assign… 11. Andrew on Polling by asking people about their neighbors: When does this work? Should people be doing more of it? And the connection to that French dude who bet on TrumpNovember 15, 2024 4:35 PM Kevin: Given that they said in their 2018 paper that their method performed well in 2016, but that claim was… 12. Kevin on Polling by asking people about their neighbors: When does this work? Should people be doing more of it? And the connection to that French dude who bet on TrumpNovember 15, 2024 4:21 PM They also claimed to get 2020 closer here (actually supported by their data this time) https://theconversation.com/election-polls-are-more-accurate-if-they-ask-participants-how-others-will-vote-150121 13. Phil on Objects of the class “David Owen”November 15, 2024 3:43 PM Bryson is sometimes angry too. In "The Road to Little Dribbling" he rails against people paving their front gardens. He… 14. Andrew on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 2:20 PM Joshua: It provides a small amount of evidence. See chapters 1 and 2 of Bayesian Data Analysis for some examples… 15. Joshua on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 12:51 PM Arrggghhh....than mine. 16. Joshua on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 12:50 PM The markets deserve their credit for getting closer; we should also place that in the context of their expressed uncertainty.… 17. Jonathan (another one) on Average predictive comparisonsNovember 15, 2024 11:59 AM This reminds me of my very first consulting job 40 years ago. We had a logit model of whether or… 18. Andrew on Average predictive comparisonsNovember 15, 2024 11:21 AM Bob: Thanks for noticing. Actually, a typo in html had garbled two references. I fixed them. 19. Rodney Sparapani on Average predictive comparisonsNovember 15, 2024 10:59 AM Another good post as always. We are often interested in the marginal effect, e.g., the effect of u integrated/aggregated over… 20. Bob Siegfried on Average predictive comparisonsNovember 15, 2024 10:19 AM Broken link @ "for example here" 21. Andrew on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 8:09 AM Poll: As I wrote in my above post, 2024 was the markets' coming-out party. They indeed outperformed the fundamentals- and… 22. Poll Sceptic on Prediction markets in 2024 and poll aggregation in 2008November 15, 2024 1:44 AM I would argue markets did better and other forecasts did worse than you imply (1) Markets (a) Sethi's chart of… 23. Jonathan (another one) on Objects of the class “David Owen”November 14, 2024 10:05 PM I'm thinking of the Nicholson Baker of Double Fold, which I grant was probably written by another author with the… 24. Andrew on Objects of the class “David Owen”November 14, 2024 9:58 PM Jonathan: I prefer to pretend that Nicholson Baker has written only two things: - The Mezzanine - U and I.… 25. Dale Lehman on Reflections on the recent electionNovember 14, 2024 9:51 PM anon 2/3 of Americans own homes, so let's not equate home ownership with only the wealthy. I'm not going to… 26. Jonathan (another one) on Objects of the class “David Owen”November 14, 2024 9:48 PM I would put Nicholson Baker's nonfiction in this class, though I greatly prefer his fiction, which is definitely not in… 27. Anoneuoid on Make a hypothesis about what you expect to see, every step of the way. A manifesto:November 14, 2024 7:43 PM Yes, Euclid's axiom that caused controversy and held back science for 2000 years until finally abandoned as part of a… 28. anon on Reflections on the recent electionNovember 14, 2024 7:43 PM Dale, your first example here proves my point (covid stimulus decreased credit card debt initially and then the stimulus went… 29. paul alper on Objects of the class “David Owen”November 14, 2024 7:09 PM Inasmuch as I doubt anyone would reply, I put it to ChatGPT; Below is the initial response followed by its… 30. paul alper on Objects of the class “David Owen”November 14, 2024 6:50 PM I really do like the writings of Bill Bryson, but there is one strange aspect to his stuff. For reasons… 31. Daniel Lakeland on Make a hypothesis about what you expect to see, every step of the way. A manifesto:November 14, 2024 6:38 PM Dale, though energy is important in some situations, it's less important in others. For example, how do people substitute between… 32. Kyle C on Objects of the class “David Owen”November 14, 2024 6:32 PM dang 33. Daniel Lakeland on Reflections on the recent electionNovember 14, 2024 6:02 PM Dale, you're absolutely right about the disconnect in voting. I have much stronger opinions about the measurement of the economy… 34. Gregory C. Mayer on Objects of the class “David Owen”November 14, 2024 6:01 PM 99% sure has always admitted of being wrong 1% of the time. We'll just have to wait to see what… 35. Joshua on Reflections on the recent electionNovember 14, 2024 5:51 PM Sorry - voted for Harris, obviously. 36. Joshua on Reflections on the recent electionNovember 14, 2024 5:50 PM but I still think that the groups that suffered the most did not vote for Trump. I must admit I… 37. Dale Lehman on Reflections on the recent electionNovember 14, 2024 5:39 PM Daniel As I said, there are many measures that could be chosen - I am not saying I favor any… 38. Andrew on Objects of the class “David Owen”November 14, 2024 4:55 PM Ahhhh, I guess 99% isn't what it used to be! 39. Andrew on Objects of the class “David Owen”November 14, 2024 4:55 PM Phil: In his early book railing against college admissions tests, Owen was positively angry in a way I do not… 40. Phil on Objects of the class “David Owen”November 14, 2024 4:47 PM I've read a lot of Owen and a lot of Bryson, and yeah, they are very similar. Bryson goes for… 41. Bob Carpenter on Call for StanCon 2025+November 14, 2024 4:31 PM I talked to Charles about this ahead of time, but I don't think it sunk in. The way a conference… 42. Bob Carpenter on Oregon State Stats Dept. is HiringNovember 14, 2024 4:25 PM Maybe more than a dozen years ago? Columbia's Data Science Institute is 12 years old, so it was already happening… 43. Raghu Parthasarathy on Objects of the class “David Owen”November 14, 2024 4:08 PM It is, certainly, excellent! 44. Raghu Parthasarathy on Objects of the class “David Owen”November 14, 2024 4:07 PM Googling this description, it is not hard to find, and not David Owen: "How I met my wife" by Jack… 45. Kyle C on Objects of the class “David Owen”November 14, 2024 3:55 PM I'm 99% sure Owen wrote one my favorite New Yorker humor pieces ever, a short story persistently deploying root words… 46. Andrew on Objects of the class “David Owen”November 14, 2024 3:53 PM Raghu: I've never read anything by Bryson, but I've heard of him, and I heard a radio interview with him,… 47. Daniel Lakeland on Reflections on the recent electionNovember 14, 2024 3:39 PM Dale, I didn't check your sources. but here's the supplemental poverty measure for 2023, I believe this is the main… 48. Geometry lover on Make a hypothesis about what you expect to see, every step of the way. A manifesto:November 14, 2024 3:38 PM Funnily enough, one of Euclid’s axioms also requires an entire Wikipedia page: https://en.m.wikipedia.org/wiki/Parallel_postulate 49. Raghu Parthasarathy on Objects of the class “David Owen”November 14, 2024 3:34 PM With the disclaimer that I've never, as far as I know, read David Owen, or even heard of him, I… 50. Aniruddha Banerjee on Help teaching short-course that has a healthy dose of data simulationNovember 14, 2024 3:34 PM I teach a one week assignment on intro Bayesian stats for my Research methods class in geography/social science undergraduate students… Proudly powered by WordPress