Levels of Scientific RigorAs a 20 year old hunting for certainty, I was surprised to learn that uncertainty is unavoidable. In a course on Mathematical Logic, we learned that even even in the symbolic logic that underlies mathematical proofs - where we are free from both the ambiguity of natural language and our fuzzy understanding of the physical world - uncertainty is inevitable. I think of scientific rigor as an attempt to increase the ratio of certainty to uncertainty.
The physicist Richard Feynman, in his Six Easy Pieces, reassures us that even in the more rigorous physical sciences, "everything we know is only some kind of approximation". Well, it's a pretty precise approximation if you can use it to thread a spaceship to Mars and lower a rover from a hovering orbiter. So I loosely use the science behind the recent Mars landing as my top level/grade of certainty. For my next lower level, I use an FDA approval of a pharmaceutical. Very high levels of purity must be proved not just for the studies themselves, but for the consumer manufacturing process. Safety and Efficacy must be demonstrated by randomized controlled studies from, preferably, two different labs. Ideally, but in practice not always, the exact biological mechanism of how the pharmaceutical agent benefits the patient are well understood. Statistical Analyses must be done by an independent group and will be closely scrutinized by FDA statisticians. The data underlying the analyses must be provided for scrutiny. |
|
Shooting for FDA Approval
Research on the brain is far from a "Mars landing" grade. FDA level research is being done primarily on pharmaceutical interventions, with for example over 2000 clinical trials studying interventions to treat schizophrenia. However, the exact biological mechanisms of many current psychological pharmaceuticals are not well understood. The new Brain Initiative and the NIH's RDoC framework seek to narrow the gaps.
So, while it would be Quixotic to aspire to Mars grade science for FaceSay research, I am shooting for FDA grade research. There is plenty of work to do, with the help of improved assessment tools for autism research.
So, while it would be Quixotic to aspire to Mars grade science for FaceSay research, I am shooting for FDA grade research. There is plenty of work to do, with the help of improved assessment tools for autism research.
We need more rigor in the social sciences
In the social sciences, we are not famous for our scientific rigor. Two economics professors at Harvard provide a cautionary tale. Their paper was reviewed, widely cited and supposedly influential on government policy makers, but apparently their analyses and supporting data were not closely scrutinized. When a bright grad student requested to see their supporting spreadsheet, he found several errors, and published a paper with corrections showing a markedly different outcome than the original.
There are several laudable attempts at increasing the level of rigor in autism research. Reichow and Volkmar's evaluative method for assessing the quality of the research is a good start. I stumbled upon their method while reading a recent review of interventions that address social impairment, by Connie Kasari, whose work I admire. The review laudably applies the evaluative method to social skills papers to rank the quality of their science. Unfortunately, the review seems to abstain from many of the principles which it is implicitly advocating.
While the paper excludes papers from the review unless they are peer reviewed, it itself was not peer reviewed. In the hopes of eliminating natural human biases, Reichow and Volkmar's method requires blinded raters for all measures. Yet the authors include two of their own papers in their review. While grading reviewed papers as unacceptable if they do not provide documentation on their methods, their own evaluation only lists the final scores, not the intermediate ratings, let alone the specific evidence from the paper from which they derived those ratings.
Our Avatar Assistant paper received what I believe is an erroneous "weak" rating in this review. I have not seen the full data on their ratings, and how they arrived at them, only exchanged emails to see if I can clarify what I think are two errors. No luck so far :-).
One of my objections is the review authors seem to ignore the definition of an "independent variable". To my mathematician's mind, something that does not vary is a constant, not a variable.at all. Thus, a concomitant medication that is shared by both the control and treatment conditions is by definition not part of the independent variable. The Hopkins paper clearly states that both the control and treatment groups received the same computer training and the same reinforcers to stay on task. That was purpose of including an active control using a non social computer game (Tuxe Painte), rather than the usual wait list control. In spite of this, the raters appear to treat the pre-intervention keyboard and mouse training, as well as the edible reinforcers (cheerios :-) ) for staying on task as part of the independent variable, even though the paper states this was done for both the treatment and control groups. Their claim that it is part of the independent variable is simply a fantasy, based on not an iota of empirical data, and in direct contradiction to what is reported in the peer reviewed paper.
The evaluative method states:
My second puzzle is that the paper concludes that FaceSay has lower fidelity than most of the manual interventions reviewed . By comparing each to a Shakespeare play, it is clear that FaceSay has far greater treatment fidelity than any manual intervention in the paper.
Let's use the "fidelity" in a Shakespeare play as our point of reference. 100% of the dialog is scripted for all of the players, with perhaps some room for improvisation for jesters :-). A performance will surely vary from one night to the next, but the words they speak will have very little (perhaps a few mistakes) and only the tone, emphasis and timing will vary. The relative position, entrances and exits, fight scenes, etc are also scripted, though perhaps these enjoy some more latitude in variations between performances. The actors' facial expressions, posture, etc may also vary. So that's our Shakespeare baseline.
How about FaceSay? There is zero improvisation in what text is spoken - other than the name of the child - and zero change in tone or timing. There is also zero variation from one student's experience of FaceSay to the next in the facial expressions or posture of the "actors". So, it's safe to conclude that FaceSay has better/more treatment fidelity than a Shakespeare play.
How about manual interventions? In a manual intervention, including those reported in Kasari's paper, how much of what is spoken is scripted word for word? Are any of the words scripted? Are there any stage directions on the precise position of the interventionists, the precise poses they strike? I don't think so. There are only general directions on how the interventionist should interact, and that prescription covers only the critical elements of the intervention. So manual interventions have less/worse treatment fidelity than a Shakespeare play.
If A > B, and B > C, then A > C, would you agree? So, how did this review paper from such an esteemed research team come to such an inverted conclusion about fidelity? And how did they end up asserting that something delivered to both groups was part of the indendent variable? I'll follow up, but I believe the former is in part explained by using a tool for a purpose for which it was not intended or validated. Reichow developed the evaluative method before he was aware of computer based interventions. Also, it is hard not to wonder if this is a side effect of the authors reviewing their own paper. If judges must recuse themselves from cases in which they have an interest, if accountants are not allowed to invest in companies which they are auditing, why should even the most esteemed social scientists be excused from this practice of avoiding bias?
- "We replicate Reinhart and Rogo (2010a and 2010b) and find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period. Our finding is that when properly calculated, the average real GDP growth rate for countries carrying a public-debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not 0.1 percent as published in Reinhart and Rogoff"
There are several laudable attempts at increasing the level of rigor in autism research. Reichow and Volkmar's evaluative method for assessing the quality of the research is a good start. I stumbled upon their method while reading a recent review of interventions that address social impairment, by Connie Kasari, whose work I admire. The review laudably applies the evaluative method to social skills papers to rank the quality of their science. Unfortunately, the review seems to abstain from many of the principles which it is implicitly advocating.
While the paper excludes papers from the review unless they are peer reviewed, it itself was not peer reviewed. In the hopes of eliminating natural human biases, Reichow and Volkmar's method requires blinded raters for all measures. Yet the authors include two of their own papers in their review. While grading reviewed papers as unacceptable if they do not provide documentation on their methods, their own evaluation only lists the final scores, not the intermediate ratings, let alone the specific evidence from the paper from which they derived those ratings.
Our Avatar Assistant paper received what I believe is an erroneous "weak" rating in this review. I have not seen the full data on their ratings, and how they arrived at them, only exchanged emails to see if I can clarify what I think are two errors. No luck so far :-).
One of my objections is the review authors seem to ignore the definition of an "independent variable". To my mathematician's mind, something that does not vary is a constant, not a variable.at all. Thus, a concomitant medication that is shared by both the control and treatment conditions is by definition not part of the independent variable. The Hopkins paper clearly states that both the control and treatment groups received the same computer training and the same reinforcers to stay on task. That was purpose of including an active control using a non social computer game (Tuxe Painte), rather than the usual wait list control. In spite of this, the raters appear to treat the pre-intervention keyboard and mouse training, as well as the edible reinforcers (cheerios :-) ) for staying on task as part of the independent variable, even though the paper states this was done for both the treatment and control groups. Their claim that it is part of the independent variable is simply a fantasy, based on not an iota of empirical data, and in direct contradiction to what is reported in the peer reviewed paper.
The evaluative method states:
- "An H rating is awarded to a study that defines independent variables with replicable precision (i.e., one could reproduce the intervention given the description pro-vided)."
My second puzzle is that the paper concludes that FaceSay has lower fidelity than most of the manual interventions reviewed . By comparing each to a Shakespeare play, it is clear that FaceSay has far greater treatment fidelity than any manual intervention in the paper.
Let's use the "fidelity" in a Shakespeare play as our point of reference. 100% of the dialog is scripted for all of the players, with perhaps some room for improvisation for jesters :-). A performance will surely vary from one night to the next, but the words they speak will have very little (perhaps a few mistakes) and only the tone, emphasis and timing will vary. The relative position, entrances and exits, fight scenes, etc are also scripted, though perhaps these enjoy some more latitude in variations between performances. The actors' facial expressions, posture, etc may also vary. So that's our Shakespeare baseline.
How about FaceSay? There is zero improvisation in what text is spoken - other than the name of the child - and zero change in tone or timing. There is also zero variation from one student's experience of FaceSay to the next in the facial expressions or posture of the "actors". So, it's safe to conclude that FaceSay has better/more treatment fidelity than a Shakespeare play.
How about manual interventions? In a manual intervention, including those reported in Kasari's paper, how much of what is spoken is scripted word for word? Are any of the words scripted? Are there any stage directions on the precise position of the interventionists, the precise poses they strike? I don't think so. There are only general directions on how the interventionist should interact, and that prescription covers only the critical elements of the intervention. So manual interventions have less/worse treatment fidelity than a Shakespeare play.
If A > B, and B > C, then A > C, would you agree? So, how did this review paper from such an esteemed research team come to such an inverted conclusion about fidelity? And how did they end up asserting that something delivered to both groups was part of the indendent variable? I'll follow up, but I believe the former is in part explained by using a tool for a purpose for which it was not intended or validated. Reichow developed the evaluative method before he was aware of computer based interventions. Also, it is hard not to wonder if this is a side effect of the authors reviewing their own paper. If judges must recuse themselves from cases in which they have an interest, if accountants are not allowed to invest in companies which they are auditing, why should even the most esteemed social scientists be excused from this practice of avoiding bias?