On Friday 26th of January at the Royal Society there was a series of talk on how psychology could progress as a science, with an emphasis on replication and reproducibility. I’m going to summarise the key points from the individual talks and make a few comments. A collection of all the individual videos of the talks can be found here.

Professor Daryl O’Connor – “Psychological Science as a Trailblazer for Science?”

The Reproducibility Project (2015) started this movement off.

The number of positive developments that arose due to the findings from the Reproducibility Project (2015) e.g. the Centre for Open Science, the Open Science Framework, Registered Reports and their associated format the Registered Replication Report, among other things, is very encouraging.

The argument can be made that the field as a whole is improving.

The discussion on social media over the Boston globe article is an example of constructive disagreement e.g. between Amy Cuddy and Brian Nosek.

But is there some element of researchers forming echo chambers among like-minded peers over the perennial tone debate?

Science as a behaviour model. Behaviour occurs as an interaction between three necessary conditions: capability (an individual’s psychological and physical capacity to engage in the activity concerned), motivation (reflective and automatic processes that increase or decrease your desire to engage in the behaviour), and opportunity (all the factors outside the individual which make the behaviour possible or prompt it). These affect and are affected by behaviour.

Other fields have taken note of what psychology has done and are learning from us.

The “revolutionaries” have improved scientific practice and triggered new ways of working.

All levels of science need to be targeted, including methodologies and incentive structures.

It is a very exciting time to be a scientist, especially an Early Career Researcher (E.C.R) because of all the changes.

Comments:

Some people have argued they first started taking notice of the problems in psychology in 2011, with the publication of Simons, Nelson, & Simonsohn (2011) and Bem (2011) . Regardless of an individual researchers starting point, the field has made great strides in a short space of time. Of course the calls to action have been ringing out for years, but actual change seems to be occurring which is highly encouraging. And I agree with O’Connor’s point that it has mainly come about because of the actions of those branded as “revolutionaries”. This isn’t to dismiss a genuine discussion about how these criticisms should be handled and that sometimes they can go to far. I think having that debate is important as it keeps the process in check. But progress isn’t going to be painless , though this pain should be minimised. As for social media, I generally think it has been a force for good with increased chances of visibility and interaction for those who typically take a back seat in discussions (though old power structures are still highly relevant and should be challenged). It is also almost universally in favour of measures to improve replicability and methodological rigour so people can see positive examples of these measures and be rewarded via complements.

Professor Andy Field – “Should Researchers Analyse Their Own Data?”

Three things that affect whether results are true: researchers degree of freedom (preregistration is crucial); Q.R.P.s; and researcher competence.

Degrees of freedom over alpha level, hypothesis, model used, researchers degrees of freedom become an issue after you start collecting data (because N.H.S.T. runs on long term alpha rates).

Q.R.P.s are prevalent, though researchers are (unsurprisingly) far more likely to say other scientists have committed Q.R.P.s than they are (Fanelli, 2009).

There is a huge spike in prevalence of p-values around 0.05 which you wouldn’t find without Q.R.P.s.

The garden of forking paths is the idea that researchers can unwittingly drive the false positive rate up through their seemingly harmless decisions about hypotheses and outliers etc.

An independent analyst would help.

They can be objective whereas people are personally invested in the work they produce.

There are issues relating to understanding of key statistical concepts: Haller and Krauss (2002): 80% of methods instructors believe at least one p-value fallacy. Belia et al. (2005) showed many have misunderstandings about confidence intervals.

For significance tests and confidence intervals to be accurate, the sampling distribution of the parameter i.e. the mean, needs to be normally distributed.

If the sample is big enough, this will be true as the central limit theorem dictates.

Heavy-tailed distributions are an issue as they greatly reduce power, bias confidence intervals, affect power in unpredictable ways in unequal groups, and the central limit theorem might only work with large N.s e.g. roughly 160. Two thirds of psychology data have heavy tails (Micceri, 1989)

People aren’t applying robust methods (but the preliminary data found by Field underestimates how prevalent this is).

An independent analyst would help researchers who may struggle with statistics/help them apply proper methods.

An independent analyst can help with problems relating to: fabricating data; dropping cases to benefit results; terminating the study at a time other than when planned; fitting multiple models and reporting the most favourable; using inappropriate designs; p-hacking; and forking paths. It might help with the problem of selective reporting.

Comments:

I think this is an excellent idea but I agree with Professor Fiske’s concern: are there enough statistically savvy researchers who can help analyse all this data? Definitely not. But if some suggested measures are adopted (multi-lab collaborations, fewer publications) then it may be more feasible to reach this goal. Even if the ideal isn’t reached and only some papers are vetted by those who are more knowledgeable in statistics, this would be a huge benefit. But who would be these statistical doyens? Is it more realistic to expect psychologists to improve their statistical and methodological knowledge enough so they wouldn’t need an expert helping them? The benefits of doing this would need to be clearly demonstrated to show it would be worth the effort . Which would likely require a shift in publication policies and incentives.

Professor Susan Fiske – “Research Methods Blogs in Psychology: Exploring Who Posts What about Whom, with What Effect”

Internalisation of good methods works better than mere compliance, with incentives being more effective than punishment.

Persuasion is better than shaming (both for the people involved and the efficacy of your criticism); are we doing this?

There was a crisis (much like the current one) in social psychology in the 1970’s. Flashy studies, failed replications, and a lack of relevance were prevalent.

This was improved by persuasion and positive examples being set.

Why the researchers wanted to focus on blogs: because the general aims of blogs are to “[promote] open critical, civil, and inclusive scientific discourse’. The ideals that blogs are supposed to aspire to are: democratic, open, equitable.

Most of the criticism made on these blogs is relevant and in good faith but there are some that cross the line.

Bias may be present in these blogs and there may be collateral damage (power dynamics, gender, crowd behaviour all affect these in meaningful ways).

Tessa West & Linda Skitka ran a questionnaire to see how people responded to seeing others being criticised on social media, with many stating they disengaged after seeing it.

The lead author believed men would be the targets of criticism more than women, which was the opposite prior belief to Fiske.

41 blogs were included, with 62% of the posts analysed coming from one blogger .

Bloggers have traditional leader demographics, as 71% were male, 92% were white, and 74% were mid-to-late career.

The blogs only mention specific people in roughly 11% of their posts (but there’s a large range from 0% to 34%) and males are more likely to be mentioned than females.

Posts mentioning a target by name tended to result in slightly more citations, more comments, and almost double the number of readers commenting.

Discussing replications and research in general increased the impact of the blog with regards to comments and citations, whilst statistics has no impact, and discussion of fraud reduced it.

Comments:

I’m going to save giving a thorough critique on the exact process of how my blog feed was scraped until the relevant paper is published (along with the data etc.) as I will have a more complete picture of what was done. The comment about the lack of diversity among the blogs is a valid point and one I sought to address shortly after data for the study was collected Because there was one huge outlier in the sample, a more detailed look at that persons blogging activity might have been useful. Many of their posts aren’t relevant to psychology as they are responses to emails or announcements about statistical software or conferences, so excluding those posts would give a more accurate picture of the relevant activity. The crisis in psychology during the 1970’s intrigues me. Was there as much discussion as there is now? Were as many positive steps taken as now? Why or why not? Why are we experiencing a crisis in confidence now if there has already been a period of soul searching in social psychology (which was the subfield that helped spark the current debate)? Has anything really changed since then?

Professor Eric-Jan Wagenmakers – “The Case for Radical Transparency in Statistical Reporting”

The researcher who conducted the research is the most biased person involved in the process. They can then analyse the data without any oversight or accountability.

Researchers have a pet theory they’ve worked on and published about previously. An experiment is designed to test this theory. Data is collected, which takes a lot of time and resources. Part of their career (and those of their students) hang on finding positive results.

Data is analysed with no accountability by the person who is most incentivised to find a positive result, often with limited statistical training.

When a significant result is found, the result is declared significant and any further questioning of the result is frowned upon as it violates an implicit social contract.

Researchers want results that don’t allow for doubt.

But as John Tukey said: “[statistical procedures should not be used]… for sanctification, for the preservation of conclusions from all criticisms”.

Perverse incentives encourage publication bias, fudging, harking.

Fishing for results can take two general forms: massaging the data (altering variables, transformations, analyses) or finding your hypotheses in the data (HARKing, Hypothesising After Results Known). These are problematic for both Frequentists and Bayesians.

There are many potential solutions. Preregistration, Registered Reports, and sensitivity analysis (where someone, preferably in an independent lab, tests the sensitivity to modelling choices by looking at the data, likelihood, and prior) are some of them. Conducting multiverse analyses, plotting the data so you can see trends e.g. in correlation, and adopting an inclusive inferential approach (reliance on p-values suppresses statistical curiosity) are other analysis related actions you can take. Sharing the data (which facilitates re-analysis), and joining the PRO initiative are positive steps.

Comments:

I am very strongly in favour of greater transparency. I’ve shared the data, code, and materials for every piece of research I’ve conducted since becoming aware of the importance of open science and will endeavour to carry this on for the rest of my career. I am a big fan of Registered Reports and look forward to conducting them in the future. I was not aware of sensitivity analysis so I need to learn about this and try to implement it in my work (and in the process teaching others around me about it so we can work towards independent labs conducting this analysis). Bayesian analysis has intrigued me for a while so I’d like to learn more about it.

Dr Richard Morey – “Statistical Games: The Flawed Thinking of Popular Methods for Assessing Reproducibility”

Many problems plague psychology, despite the vast majority of players acting in good faith.

The current incentive structure encourages scientists to make surprising and novel findings.

“If the incentives aren’t truth and efficacy, science is becomes a game we play”.

The methodologists have exactly the same incentives: science is highly competitive; revealing problems in published research is sexy; many scientists don’t understand statistics; and less scrutiny due to limited interactions between related fields like psychology, statistics, and philosophy.

The test for excess significance (Ioannidis & Trikalinos, 2007) . is an example of a meta-analytic technique that was widely touted despite poor theoretical and empirical backing. “Central assumption that studies are coming from the same set is violated”.

You have to consider the process that lead to the result, and ask: “Is it biased?”

If you don’t know the data collection pattern, you can’t know if it’s biased.

The groupings of results matter; it’s important to look for structure in results.

How could this technique be used so many times despite its flaws?: there is a lack of qualified reviewers in psychology, there is a default scepticism towards critiqued results, and “even if it is flawed, it raises awareness of a problem”. The emphasis is therefore still not truth and efficacy.

Researchers should be sceptical of meta-analytic forensic methods. They are often poorly vetted and rely on untenable assumptions about scientific progress and hidden theoretical assumptions.

There are some general problems in the replication movement: confusion of core statistical issues e.g. power (Mayo & Morey, 2017); too much trust placed in simulations as science is not a random process and simulations only show a technique works on average; and a lack of critical analysis of methods (likely due to confirmation bias).

The methods need to be vetted properly.

A better understanding of null results is required, as well as greater methodological transparency.

More rigourous, multi-lab exploration of phenomena is required, as well as more confirmatory work, and an elimination of the incentives which encourage “game playing” (fishing for flashy results etc.)

Comments:

I hadn’t considered the fact that researchers who critique others’ work and develop tools to find errors in it are under the same incentives as other researchers. But now Morey has highlighted this, it seems obviously true. As Morey argues, the same level of criticism needs to be applied to the tools as well as to the original research it was designed to analyse. Obviously I agree the incentive structure needs to change so “truth and efficacy” are the goal rather than flashy results, but that is easier said than done. I firmly believe Registered Reports help move science in the right direction, as well as these conferences that highlight the issues to both a young and older audience so they can try to improve their work.

Dr Katherine Button – “Collaborative Student Projects: Grassroots Training for Reproducible Science”

Button et al. (2013) found that most studies had low power and biased effect sizes.

Student projects are perfect example of issues in psychology: multiple projects with few resources are run (limited time, money, and access to participants) and the assessment criteria mainly focus on individual contributions, creativity, and novelty as opposed to rigour.

This leads to multiple small, underpowered, poorly designed studies which test novel hypotheses (rather than focus on replication) that can be analysed with undisclosed flexibility.

This sends all the wrong messages.

To counter this, Button worked with collaborators across institutions to set up multi-centre projects for final year undergraduates. The focus would be on methodological rigour: good design, adequate power, preregistered hypotheses and analyses plans, and open data and resources. Students would collaborate to replicate previous work.

The two biggest predictors of whether you’ll “make it” in science are number of first author publications and impact factor.

But impact factor isn’t a mark of quality (Brembs, 2016).

The paper “A Manifesto for Reproducible Science” by Munafo et al. (2017) offers some solutions to these problems.

Researchers need to train undergrads and themselves to conduct rigourous and open research (which takes more time and effort but is of a higher standard).

Comments:

I think this is an absolutely brilliant idea and think all universities should utilise it. It would require the move away from first author papers being one of the main judges of scholarly worth and more towards the methodological rigour of the science you are involved in. There are genuine worries about how to ensure there is equal contribution but, in my view, the pros vastly outweigh the cons.

Chris Graf – “What Can Publishers Do to Support Research Integrity?”

Three things publishers can do to support research integrity: listen and understand; support quality actively; and collaborate.

Eight things we can do to reduce stress around reproducibility (Leek, 2017), one of the main ones being: cut each other some slack.

Sally Rumsey: researchers do not want to know about over-complex sharing options, those providing the infrastructure can make processes simple.

Making journals more open and transparent with their decisions would help improve replicability and the quality of science.

Wiley helped implement a programme that scans the input for errors in their published papers.

Registered Reports help remove many of the problems currently seen in psychology (and science in general).

Wiley recognises the importance of publishers implementing the TOP guidelines.

Comments:

Wiley do seem to be making positive strides towards encouraging scientists to produce more reproducible and higher quality science. They offer Registered Reports and have signed the TOP guidelines which I think are commendable steps. In the current system, getting publishers on board is essential for any change to occur. I’ll admit I don’t know enough about the politics of publishing to argue whether we can do without them but they are certainly here to stay for the time being. And publishers getting behind improvements to science is definitely a good thing. I also agree with the sentiments highlighted in the talk from Leek and Rumsey. My natural inclination is to be charitable in my interpretation of someone’s wrong-doing so I’m all for cutting people some slack . And there are a lot of demands on researchers currently, making the process of sharing data and materials as easy as possible is a prerogative. This is why I think having a central repository like the Open Science Framework is necessary. The programme implemented at Wiley to scan all potential publications for errors sounds excellent and exactly like StatCheck. I’m strongly in favour of StatCheck and think it should analyse the results of every publication .

References

Belia, S.; Fidler, F.; Williams, J.; & Cumming, G. (2005). Researchers Misunderstand Confidence Intervals and Standard Error Bars. Psychological Methods, 10(4), 389-396. http://dx.doi.org/10.1037/1082-989X.10.4.389

Bem, D. J. (2011) Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425.

Bishop, D. (2018). Improving reproducibility: the future is with the young. Available at: http://deevybee.blogspot.co.uk/2018/02/improving-reproducibility-future-is.html

Brembs, B. (2016) Even Without Retractions, ‘Top’ Journals Publish The Least Reliable Science. Available at: http://bjoern.brembs.net/2016/01/even-without-retractions-top-journals-publish-the-least-reliable-science/

Button, K.S.; Ioannidis, J.P.A.; Mokrysz, C.; Nosek, B.A.; Flint, J.; Robinson, E.S.J.; & Munafò, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376 doi:10.1038/nrn3475

Fanelli D (2009) How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data. PLoS ONE 4(5): e5738. https://doi.org/10.1371/journal.pone.0005738

Haller, H. & Krauss, S. (2002) Misinterpretations of Significance: A Problem Students Share with Their Teachers? Methods of Psychological Research Online, 7 (1).

Ioannidis, J.P.A. & Trikalinos, T.A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4 (3), 245-53.

Leek, J. (2017). A few things that would reduce stress around reproducibility/replicability in science. Available at: https://simplystatistics.org/2017/11/21/rr-sress/

Mayo, D. & Morey, M. (2017). A Poor Prognosis for the Diagnostic Screening Critique of Statistical Tests. Available at: https://osf.io/tv4bq/

Micerri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures. Psychological Bulletin, 105 (1), 156-166.

Munafò, M.R.; Nosek, B.A.; Bishop, D.V.M.; Button, K.S. Chambers, C.D.; du Sert, N.P.; Simonsohn, U.; Wagenmakers, E.-J.; Ware, J.J.; Ioannidis, J.P.A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, Article number: 0021, doi:10.1038/s41562-016-0021.

Open Science Collaboration. (2015) Estimating the reproducibility of psychological science. Science, 349 (6251): aac4716. doi: 10.1126/science.aac4716. pmid:26315443.

Simmons, J.P.; Nelson, L.D.; & Simonsohn, U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22 (11), 1359 – 1366.