For my Advanced Statistics class this week, I had to choose a peer-reviewed journal article using regression methods and "evaluate the appropriateness of their statistical testing and of their reporting on the same criteria as used by McCloskey (1985), Gelman and Stern (2006), and Fidler and colleagues (2004)."
I chose the already critiqued article by Hawkins and colleges (2003) on the effect of government funded marriage promotion initiatives. Other scholars have previously pointed out concerns regarding this article such as arbitrarily creating two time periods for comparison and strategically manipulating an outlier (District of Columbia) in order to produce statistically significant results.
Here, I restrict my critique to inaccuracies as described in my class readings and share my response to the class assignment.
Substantive vs Statistical Significance
First, the authors used the words “significant” and “non-significant” throughout the results and discussion section (a total of 26 times). In some of these instances they clarified that significance was in reference to statistical significance, such as in this statement, “…all of the regression coefficients for the years 2006 – 2010 were statistically significant (pp 508)." However, differentiation between statistical significance and substantive significance as described by McLoskey was missing, as the majority of the time the authors were referencing statistical significance, not substantive significance in their analysis. In at least one instance, it was entirely unclear if the authors meant statistical of substantive significance:
“Our analyses found that cumulative per capita funding was associated with a small but significant decrease in the percentage of nonmarital births and children living with single parents, an increase in the percentage of children living with two parents, a decrease in the percentage of children who are poor or near poor, and an increase in the percentage of married adults in the population (but only for 2005 – 2010).”
Also of concern, the authors chose to set significant levels at .10 level, rather than the conventional .05 level. The authors argue a .10 level is appropriate "because the risk of a type II statistical error (a false negative) is relatively high with a sample of 51 cases, we adopted a .10 alpha for significance testing.” While Fisher’s p < .05 is arbitrary, they go on to conflate the likelihood of making a type II error with meaningful significance. Setting the p value at .10 only means that there is a 10 percent chance of observing a result when in fact no real effect exists. Given that their statistically significant findings depend on one outlier case (District of Columbia) and a higher threshold than conventional practice, I’m incredibly skeptical of their conclusion of meaningful difference.
At one point, they seek to address the problem of statistical significance and substantial significance by stating:
“Nevertheless, our study found statistically significant associations between per capita funding and several other important population-level outcomes. Still, one can ask where these associations are large enough to be substantively important. We address this question with reference to a particular high activity state: Oklahoma” (pp 510).
I find it odd and misleading to cherry-pick 1 unit of observation (Oklahoma) to make the connection between statistical and meaningful results. Their research question wasn't about changes in Oklahoma and other states (or states with more marriage promotion funding vs states without this funding) but rather a statistical analysis of a national program. Furthermore, the authors reviewed the changes in Oklahoma by talking about point increases and decreases in percentages on a number of indicators but never provided the actual percentages for reference nor did they provide a comparison of changes among states without marriage promotion spending. Thus, the reader is left to his or her own interpretation of whether a “3-point increase in the percentage of children living with two parents” in Oklahoma is substantially meaningful.
Comparing coefficients vs Conducting tests of difference
More problematic was the incorrect comparison of significant and non-significant coefficients versus conducting tests of difference between coefficients, as described by Gelman and Stern. While the two time periods are subjectively constructed as comparisons to start with, the authors made this error by contrasting the time periods by comparing the degree of statistical significance. This is evidenced by this statement regarding Table 2:
“Table 2 shows the results of regression analyses conducted separately for two time periods: 2000 – 2005 and 2006 – 2010. No regression coefficients for the years 2000 – 2005 were statistically significant. In contrast, the exception of percentage divorced, all of the regression coefficients for the years 2006 – 2010 were statistically significant” (pp 508).
As we know from Gelman and Stern, comparing significance levels is inappropriate and stating “2000- 2005 is not statistically significant but 2006 – 2010 is significant” is misleading. The authors failed to test the difference between the coefficients for the two time periods and across each of their variables of interest.
Confidence Intervals, Quantitative Precision, and Effect Size
Finally, they failed to convey quantitative information necessary to make meaning of the results as indicated by Fidler. Confidence intervals were missing entirely from the results although they did report standard errors in the tables and made occasional mention of them, “The table also reveals that the standard errors were considerably larger in the earlier period” (pp 509). However, their mention were limited to obvious statements of statistical analysis, rather than adding quantitative precision. Again, reporting the effect size was largely absent and mentioned in passing such as in this statement, “Although not significant, the coefficients for the remaining variables were in the same direction and comparable in magnitude to their counterparts in Table 2. The lack of significance can be explained by the larger standard errors.” The exact magnitude of the outcomes remained unstated.
Fidler, Fiona, Neil Thomason, Geoff Cumming, Sue Finch, and Joanna Leeman. 2004. “Editors Can Lead Researchers to Confidence Intervals, but Can’t Make Them Think: Statistical Reform Lessons from Medicine.” Psychological Science 15(2):119–26.
Gelman, Andrew, and Hal Stern. 2006. “The Difference between ‘significant’ and ‘not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60(4):328–31.
Hawkins, Alan J., Paul R. Amato, and Andrea Kinghorn. 2013. “Are Government-Supported Healthy Marriage Initiatives Affecting Family Demographics? A State-Level Analysis.” Family Relations 62(3):501–13.
McCloskey, Deirdre N. 1985. “The Loss Function Has Been Mislaid: The Rhetoric of Significance Tests.” American Economic Review 75(2):201–5.