…but is it evidence?

Have you ever found yourself in an argument with someone who claims they have “evidence” for something and will not shut up about it? Do you find yourself feeling uncomfortable with quarrelling with said “evidence”, even though you know the other person is wrong?

Here is a simple guide to spotting what is good evidence and what is not.

Is it even research?

This shouldn’t even need to be said, but I have seen a lot of people who believe a hypothesis to be evidence. It is not. This is particularly true in the case of evolutionary psychology: much of it is hypothesising without research. In other words, the authors publish a paper about what they think might be the case without actually testing whether it is the case.

It is perfectly possible to publish hypotheses without tests to stimulate discussion and debate. There is a whole journal dedicated to this! Medical Hypotheses has published some gems, including a particularly offensive paper about how “mongoloid” is an adequate name for people with Down’s Syndrome because like people from the Far East, people with Down’s Syndrome sit cross-legged and like eating food with MSG in it. Seriously. That was actually published.

Check whether the “evidence” contains data collection and statistical tests. If it does not, it is likely to be wild speculation, not evidence.

What sort of research is it?

This graphic is called the “pyramid of evidence“. It is a good way of looking at the best sorts of evidence in medicine, although it can be applied elsewhere. At the bottom is “background information”–upon which hypotheses are formed and, as seen above, sometimes published to stimulate debate. Moving up through the pyramid, we see better types of evidence: case studies, cohort studies, randomised controlled trials, and then, right at the top, systematic reviews. A systematic review is the “gold standard” of evidence. It takes all of the data for all of the tests of a theory, drug, medical intervention, etc, and puts it together into one data set, spitting out an “effect size” which tells us exactly how effective or “correct” the object in question is.

The differentiation between different types of research is important. Cohort studies are usually correlational: while it is not entirely true that correlation does not imply causation, correlational studies can only point us in the right direction. To properly establish causation, we need to manipulate some variables. Say, for example, we want to test whether exposure to feminist thought leads to lower levels of sexism. This can be tested by exposing one group of people to feminist thought, while having a control group of people who were not exposed to feminist thought. Before and after exposure, one would measure levels of sexism. This study has actually been done, and found that sexism decreased following exposure to feminism.

If the “evidence” being provided is one correlational study, then it might not be very good evidence. Ask if there’s any systematic reviews available, or at the very least an experimental study.

Quality of the evidence

On the evidence pyramid, there is a second dimension: quality of the research. Quality is made up of a number of important attributes, and it is important to check whether the evidence is good quality or not.

One crucial indicator is the sample. To get good results, the experiment needs to be conducted on a large group of people. The sample should, ideally, comprise of different people from different walks of life. Unfortunately, a lot of psychological research is conducted on psychology students, which throws a lot of it off-kilter, as students are younger and richer than most of the rest of us, and a lot wiser to taking psychological tests. Look and see who was in the study. It is a useful way of understanding how well the results apply to everyone else.

Another aspect of quality is the state of the comparison group. If there is no comparison group whatsoever, be very cautious: the evidence is probably terrible quality. I have seen many people try to draw conclusions about the differences between men and women based on studies of only men, or only women. The comparison group, if  present, needs to be, of course, comparable. If a study is testing the differences between men and women, and the women in the comparison group are less educated, for example, then the results could be down to education rather than gender.

For the sake of brevity, I point you towards this excellent (freely available) paper which teaches readers to critically evaluate the quality of a paper. Knowledge of this is power.

Popular science books are not evidence

Anyone can write and publish a book, particularly with the age of self-publishing. Even books from “big names” such as Steven Pinker are not good evidence, as books are not subject to peer review. Peer review is a process which is used in the academic community for checking whether a paper is valid: before anyone publishes the paper, it will be read through by several other experts in the same field of research. Often, the reviewers will want to see some of your data to verify your findings. They also, more often than not, send the paper back to you and tell you that perhaps you might want to reinterpret your findings or clarify certain bits of the research, or that you’ve made a massive honking error. They also ask you to draw attention to the limitations of your research, so readers can be aware of any of the possible pitfalls in the papers outlined above. It’s a lengthy process, but it means that  journals aren’t publishing any old crap.

For books, this is not the case. Often, the text is read by an editor with no experience in the field of research. If the writer fucks up somewhere, it won’t get caught and will be published anyway.

One example of this is the book The Spirit LevelThere are a few holes in the evidence presented in the book which are dealt with in the reply book The Spirit Level Delusion. The author of Delusion rightly criticises problems which appear in the book, though, unfortunately, is tilting at windmills: most of the peer-reviewed evidence upon which The Spirit Level is based stands up pretty well. It is only some of the bits that didn’t get peer-reviewed and were thrown into the book anyway which can be picked apart. Essentially, The Spirit Level stands up, but due to the sloppiness of the book publication process, it left itself with some open goals in the form of downright shoddy analysis, leaving many (wrongly) thinking the entire theory disproved.

If the only evidence linked is books, be wary. Demand to see peer-reviewed evidence instead. These days, a lot of it is available for free, and even if a paper is not, you can usually see the abstract.

I hope this guide will be helpful for would-be troll-slayers. Use your knowledge. Use it wisely. Happy hunting!

20 thoughts on “…but is it evidence?”

    1. It’s getting a lot better: you can almost always get hold of a summary of it now–and often, if you search the authors, they’ll be hosting a free copy.

  1. I agree with most of that, but I have so much to add, I should probably write my own blog post. Here goes:

    First, whilst peer-reviewed academic papers can usually be relied upon, I would not treat them as being much more reliable than well-sourced, well-argued books. I’ve been through the peer-review process quite a few times and I know how flawed it can be. It’s luck of the draw: if you get good reviewers who really know your topic they can improve your paper enormously. However, you quite often get reviewers who don’t understand your paper that well and some (most!?) of them have big egos. As you gain experience in the process, you learn how to massage and flatter those egos if you want to get your paper published. This tends to happen more when you’re doing more original or ‘cutting edge’ research. It also means that it is quite possible to publish garbage, if you know how to get it through the review process. Believe me, it’s tedious but not that hard and when academics are under pressure to produce papers, garbage will get published.

    Another problem with peer-review is that it is inherently inimical to research that is in any way controversial but honest. The more reviewers you have, the more difficult it will be to stay honest and publish a controversial thesis, because the probability that one of your reviewers will hate it approaches 1, especially if it contradicts previous work undertaken by that reviewer. There are other problems with peer review, but I’ll stop there.

    No book or academic paper is ever beyond criticism, regardless of whether it has been through peer-review or not: I like the fact that you distinguish between different ‘grades’ of academic study on the basis of the breadth and quality of evidence presented. As you say, The Spirit Level stands up very well, despite its minor flaws. It’s main detractor, Chris Snowdon (author of The Spirit Level Delusion), is not a statistician or epidemiologist but a historian, paid by right-wing think-tanks to advance the agenda of narrow interest groups. Hidden agendas are something else one needs to be aware of when evaluating a thesis and this is increasingly true also in the academic realm.

    My main point is that one should always assess an argument, a paper, a book or a thesis on its own merits, taking account of the quality of its sources and the integrity of its arguments whilst being aware of possible hidden agendas, rather than relying on the peer-review process as a stamp of quality. I want to add some observations on the nature of evidence, but I’ll do it in a separate comment for ease of discussion.

  2. Hi Stavvers. I’m not clear how, at least until a man (or a lady, of course) invents a time machine, evolutionary psychology theories could be tested to your satisfaction? We can’t test the ‘Big Bang’ directly, but we can see all the astronomical evidence pointing to it being true. That doesn’t stop many millions of religious people denying it.

    Feminists are the secular equivalents of religious fundamentalists. If we open our eyes, we continue to see gender-typical behaviours everywhere we look. Why do women with any choice in the matter seek to pair up with well-off men to finance a good standard of living, rather than through achieving the same resources through work? And why do women seek partners taller than themselves? There’s something fundamental and biological going on here that no amount of wishful feminist theorising is ever going to counter.

    Having read both ‘The Spirit Level’ and ‘The Spirit Level Delusion’ I have reached the conclusion that the first book, far from having a few minor flaws, has more flaws than the Empire State Building.

      1. Stavvers. you want me to ‘learn science’? A typically patronising remark, but well done on avoiding the f-word for once. I did a science degree in the late 70s and have gone to considerable efforts to keep informed about scientific matters for 30+ years.

      2. Stavvers, thanks for the repeated snide comments about my self-publishing my books. Do you imagine writing a blog is a superior form of communication than writing books? To my mind it takes a small fraction of the effort. Maybe you should try writing books before you sneer at people who have actually made the effort.

        Writers who have started their careers by self-publishing include Virginia Woolf, Shelley, Gertrude Stein, Thomas Paine, Stephen King, Bill Bryson, Mark Twain, JMes Joyce, Anais Nin, Robbie Burns, Alexander Pope, John Galsworthy, Beatrix Potter, Lord Byron and many, many others. While Jeffrey Archer and Jilly Cooper never self-published.

      3. NOT TO GO ON YOUR BLOG

        Stavvers, let me know if you want any (free) advice on self-publishing, or if you’d like to be published by me for a modest fee – [LINK REMOVED] – although I think your titles would sit oddly next to mine! Vive la difference.

        My book ‘POO IN MY PANTS’ retails for £20.00 on Amazon (ebook edition for Kindle, iPad etc under £8.00) and it tells you all you need to know about self-publishing in the current era. Happy to offer the paperback at £9.95 inc p&p to you and the good people who follow your blog. Because that’s the kind of guy I am.

        All the best.

    1. Hello Mike. Just a couple of points to take you up on:

      1) The problem I see with evolutionary psychology is similar to the problem of confirmation bias. It’s not that I think ev pysch is necessarily worthless, but its evidence base is generally very poor, because of a failure to evaluate alternative hypotheses. Ev psych comes up with a theory and then concludes that the theory is good simply because it is CONSISTENT with the data, but that’s not sufficient (if you refer to the discussion on Bayesian reasoning below).

      Usually, there are numerous alternative interpretations of the data and ev psych needs to consider evidence which distinguishes between them. Unfortunately, it rarely does this, which is why it tends to reduce to mere speculation.

      2) There is a similar problem in ascribing gender differences to nature rather than nurture. The evidence base is poor due to the huge number of potential confounding factors in the environment. To be fair, it is also hard to be sure that any differences should be ascribed purely to social factors, too.

      You do also seem to have a stereotypical view of feminism. Maybe you just haven’t met the right feminist yet? 🙂 The ones I know are not extremist in the sense of being environmental determinists: they just offer a valid critique of biological determinism, as the latter seems to be too much in vogue at present.

      There’s no denying that social and cultural factors play a large part in the development of human behaviour, as does genetics. Discrimination remains a widespread problem and biological determinism has an unfortunate tendency to reinforce it.

      In the end, we are all individuals who are the product of both environmental and biological factors, including our own choices and desires, which may or may not be gender-typical. Gender, like race, is not as precise a concept as our culture makes out. It may indeed be more of a social than a scientific concept, in which case its use as an explanatory variable in scientific theories is strictly limited.

      I would prefer to view gender not as a binary but as a multi-dimensional spectrum, albeit with certain clusters we label for convenience as ‘male’ and ‘female’. Sexuality can be seen in the same way. When you look at things this way, you see human beings in their full, glorious complexity. Then statements like “Women choose high-status males” cease even to be interesting, because they refer to a category (‘women’) which has little descriptive value – it’s a pretty arbitrary way to split the human race in two. That’s why generalisations suck and gender determinism is not scientific.

      1. Chris, thanks for this. I love the idea of not having met the right feminist yet haha. I honestly believe that an overwhelming majority of people in the UK (including myself) believe in equity feminism i.e. equality of opportunity. The problem lies in gender feminism, which seeks equality of outcome, and is driving so much including for example the ‘gender balance in the boardroom’ debate. I don’t personally know a gender feminist, and I don’t think I’ve ever met one. How, in a modern democracy, can such a tiny band of people control the legislative agenda, as they do here with the Equality Bill (2010), a Harriet Harman piece of legislation enacted by David Cameron?

        I accept of course that there’s a multi-dimensional spectrum, but I think if we deny that at least gender-typical men and gender-typical women (who are the majority of people – maybe your ‘clusters’?) have a tendency to think and act differently, much of the real world simply doesn’t make sense. I think we’ve been blinded by what a small minority of women -radical feminists – SAY women want, and what the majority of women DO when presented with choices.

        Just watched on the BBC news a piece about an engineering award under the Queen’s name. Now I understand 95%+ of UK engineering graduates, even in the modern era, are men. Not one of the numerous shots of engineers failed to include female engineers. David Cameron was seen chatting with a black female engineer. Well done Dave, three boxes ticked at one go!

        We see few women becoming engineers or executive directors of major company and the trotted-out explanation is (as it has been for 30+ years) a lack of role models. I simply don’t believe it – doubtless we’ll still be told the same in another 30 years. You don’t have to be at the extreme end of the nature/nurture spectrum to think that there’s something biological at the very least contributing to it. I think Simon Baron-Cohen’s theories are borne out by the real world, and I know that isn’t a scientifically rigorous comment!

        Quotas for women – let’s call it what it is, positive discrimination – are an insult to women. And when more people realise that women are in positions despite not being ‘up to the mark’, surely that will harm women’s prospects in general? Is it a price worth paying, potentially weakening our companies on the altar of gender feminism? For me, it’s not.

        1. Dude, you insist on banging the same tedious drum in every thread, and it is never relevant.

          This is your first warning. You will not have any more comments approved if you do not MAKE SURE THEY ARE RELEVANT RATHER THAN JUST BORING BALLBAGGERY ABOUT THINGS YOU FIND INTERESTING.

          There will be no more warnings. Get better.

          1. Stavvers, I’m not so needy as to adjust what I write so as to accord with your ideas of what is ‘relevant’, and have no intention of doing so. I have far better things to do with my time than censor myself. I was enjoying the exchange with Chris W, he or she was making sound and often nuanced points. Goodbye.

  3. Stavvers, I’m struggling to equate your interesting original post in this thread with my experience of your blog over recent months. Only very recently you dismissed a number of books by leading academics – mostly women – as ‘particularly shit’. Everything that doesn’t agree with your feminist philosophy is attacked, no nuances are allowed.

    I doubt that 0.1% of ‘feminist theory’ would survive the critical approaches you favour in this thread, which in itself is evidence that your arguments in this area are sound. I appear to have agreed with you. I must go and lie down in a dark room before pressing on with my new book. So many sacred cows to slaughter, a whole damned herd of them in fact!

  4. I can see how this would be useful when dealing with trolls and I think it is a useful primer for some kinds of research, but still, tbh I don’t particularly like the values and ideas this view of knowledge seems to promote (particularly when it’s research which is in some way linked to oppression, so for instance a lot of medical research that overlaps with disability research)

    -objective, impartial experts are the best people to decide what is important to research, the way it should be framed, and the way the research should be carried out.
    -the power involved in research production (for instance some people having the resources to carry out research and other people not) is irrelevant.
    -quantitative data is more valuable than qualitative

    Personally, I’m much more of a fan of ’emancipatory research’ paradigms which suggest that research is an inherently political process, which look at the broader process of research production, and work to produce research which will help people in their everyday lives. I’ve mostly read about this in disability studies, where it emerged to counteract medical research which was perceived as oppressive, patronising and unhelpful, although it exists in other fields as well. For instance, in mental health research, I would probably value a service-user led, qualitative approach above the kind of evidence the bmj is promoting as ‘good’.

    1. I didn’t talk on emancipatory paradigms, but I’m actually a bit fan of them myself–I did some research with people with learning disabilities a few years back.

      I think much of what applies here–particularly regarding quality–can be applied to qualititative research too. You’ve reminded me I need to post about how awesome qualitative research is 🙂

  5. OK, what is ‘evidence’? I approach this from the viewpoint of an information-theorist and Bayesian probabilist, so forgive me if I go off on one. Please come along for the ride:

    ‘Evidence’ is any fact which conveys information, in the sense of being new, interesting or even surprising. Evidence is something that makes you change your view about the state of reality or at least distinguishes between alternative possibilities. That’s why tautologies and contradictions are not evidence. It’s also why a not-guilty plea by someone on a murder charge is not considered evidence: they would say that anyway, wouldn’t they? Actually, it is evidence of a sort because some people plead guilty, so a not-guilty plea does convey some information favouring innocence, but not very much.

    In Bayesian terms, evidence forces you to update your belief about something by affecting its (posterior) probability, which is obtained via Bayes’ Theorem:

    Pr(H|D) = Pr(D|H) Pr(H) / Pr(D)

    I’ll explain it in detail. Here, Pr(H) is your ‘prior’ probability for some hypothesis, H (H might be something like ‘All coppers are bastards’). Pr(H) is your baseline level of belief in H; which may come from prejudice, reasonable bias or previous evidence. It is an inherent factor in all reasoning and is even popularised in the concept of ‘Occam’s razor’: Occam’s famous razor advises us to favour the ‘simplest’ hypothesis which explains a set of data, by which it means the hypothesis with fewest explanatory elements. This is because the simplest hypothesis is also usually assumed to be the one with the highest prior probability, Pr(H). Thus, in the absence of any actual evidence, or if the evidence is equal, we favour H such that Pr(H) is greatest and this is usually the simplest hypothesis.

    In mathematical terms, Occam’s razor is a prescription for inducing a ‘prior distribution’ over hypotheses, on the basis of their supposed simplicity. It’s usually a good rule of thumb and even mathematicians use it in modified and more precise forms, for example in regularisation theory. However, it is only a very rough heuristic for use in Bayesian reasoning and it is sometimes horribly misused, especially by historians (becoming ‘Occam’s eraser’). This frequently upsets me and makes me want to teach the world how to use Bayes’ Theorem properly. Anyway, that’s the ‘prior’, Pr(H), taken care of: it encapsulates your prior belief in the truth of a theory or set of theories. What about the rest?

    Well, Pr(D|H) is the really important bit. Statisticians call it the ‘likelihood’: it’s the probability of the data (e.g. experimental outcome, observations, test results, known facts, etc.) GIVEN the hypothesis, H. Bayesians also call Pr(D|H) the ‘evidence’ for H and this is close to the natural language meaning of ‘evidence’. If the data supports a particular theory, then Pr(D|H) will be high. If not, it will be low. So, if you see a copper holding a kitten and stroking it lovingly, the probability of this is low if we assume ‘All coppers are bastards’; hence Pr(D = ‘stroking kitten’ | H = ‘ACAB’) is close to zero.

    At this point, a traditional statistical hypothesis test would simply reject the hypothesis, H (if we call it our null hypothesis), and adopt the alternative hypithesis, H1: ‘Some coppers are not bastards’. Bayesians are rightly suspicious of standard hypothesis testing, because it can be quite misleading. First, the data, D, used to draw conclusions is usually some distilled ‘test statistic’ which is only one summarised aspect of the overall data (in practice an F-statistic, a t-value or some such). In our example, the fact that a copper was stroking a kitten is just one aspect of his behaviour. The test may not account for the fact that he just smacked a harmless young woman over the head with his baton, prior to stroking the adorable little kitten. That’s one danger of using summary statistics rather than assessing the likelihood of the whole set of data (i.e. evaluating ALL the evidence). Bayesians advocate constructing a detailed generative model for the data and evaluating the evidence as a whole, Pr(D|H) rather than just Pr(Test statistic | H). In practice, this is often hard, so we go along with hypothesis testing, even though the resulting p-values depend greatly on the power of the test and the exact statistics used. Still, we need to be aware of the severe limitations of hypothesis testing and start thinking in a more Bayesian way.

    Another deeper problem with hypothesis testing is that we only consider two competing theories: the null hypothesis and its negation. The world is far more complex than this and Bayesian reasoning recognises that, usually by introducing parameterised models, but this is beyond the scope of the discussion. However, one may adopt a Bayesian approach by recognising that one must compare not just two hypotheses, but a whole set of possible hypotheses: H0, H1, H2, H3, … etc. (call them Hn with n = 1, 2, 3 … )

    For example, a Bayesian might introduce H1: ‘Most coppers are bastards’ or H2: All coppers are bastards when in uniform’, H3: ‘All coppers are bastards except around kittens’, and so on. The Bayesian framework allows us to assess the evidence, Pr(D|Hn) for each of these more nuanced theories simultaneously. Combined with the priors, Pr(Hn), this allows us to assess the posterior probabilities, Pr(Hn|D) which represent our conclusions. (Note: Pr(D) is the unconditional probability of the data, which does not depend on any hypothesis and so we can effectively ignore it – it’s just a normalising constant).

    The posterior probabilities, Pr(Hn|D) are our new states of belief in the various hypotheses. Remember that we started with a set of prior beliefs, represented by the probabilities Pr(Hn). Now, after assessing the evidence for each hypothesis, Pr(D|Hn) we reach a new set of beliefs represented by Pr(Hn|D). That’s Bayesian reasoning and that’s the correct way to assess the evidence for a set of theories in the presence of uncertainty.

    The most important point to take away from this is that theories cannot be assessed in isolation. All theories must be measured relative to their alternatives. For example, the evidence for some theory, H1, may have a very low likelihood – Pr(D|H1) is small – but if all the plausible alternatives are even less consistent with the evidence, then one must favour H1, at least until some better hypothesis can be found. Simple example: the probability of tossing a coin and getting HHTHTTTH is (1/2)^8 which is miniscule (~ 0.0039) but that doesn’t mean the coin is biased: on the contrary an unbiased coin is the best hypothesis which fits the data.

    Conversely, a theory may appear to have a high likelihood but that does not mean it is correct. Confirmation bias is an example of this. People tend to seek confirmation of their prior prejudices by looking for data which is highly likely on the basis of their prior belief. Daily Mail readers are forever having their biases confirmed by tales of illegal immigrants committing benefit fraud, for instance. In this case, they have a prejudice such as H1:”All immigrants are benefit cheats”. They also ascribe a high prior probability, Pr(H1), to this, because they don’t consider alternatives. The stories they read are, of course, highly likely if one assumes H1, so Pr(D=”Story about benefit cheats”|H1) is pretty close to 1. It’s just that there are many other theories which are just as consistent with this limited data.

    The problem is that they fail to consider other hypotheses and other evidence. We should introduce them to H2:”Most immigrants work incredibly hard” and show them the evidence which contradicts their favoured hypothesis, H1.

    This ends the introduction to Bayesian reasoning. It really should be taught more in social sciences and humanities courses.

    1. Chris, thanks for the post, very enlightening!

      One of my objections in the field of gender politics is the common assertion that all gender differences are cultural in origin, or down to that dreaded thing, patriarchal hegemony (I feel a headache comng on).

      For example, the fact that there are some successful female entrepreneurs is taken as evidence that women in general are intrinsically as entrepreneurial as men, when they’re clearly not (they seek government support for female entrepreneurs – you couldn’t make it up). By the same logic, because some women are tall, we would conclude that on average women are as tall as men.

      I don’t think I’ve seen a single illustration of gender differences for which hardworking feminist theorists haven’t managed to dream up a subverting argument. When a man fails to reach the boardroom it’s down to him. When a woman fails to reach the boardroom it’s evidence of the glass ceiling. When a man fails in a senior executive position it’s accepted he has failed for a variety of reasons (lack of competence, application, leadership abilities…) When a woman fails in a senior executive position it’s evidence of the ‘glass cliff’, it’s NEVER attributable to any deficiencies on her part. These are of course but small illustrations of dualism, ‘Women good, men bad’. What a dismal way to look at the world. It’s little wonder feminists are so unremittingly miserable.

      1. Perhaps I could sum up my earlier reply more briefly. A lot of the talk about innate gender differences comes down to saying “Possession of a penis correlates (quite weakly) with X, Y and Z”. This is no more enlightening than saying “Possession of black skin correlates (quite weakly) with X, Y and Z.”

        You see what I’m getting at? It’s arbitrary and not very scientific. It doesn’t leave much room for individual differences and it just reinforces unhelpful stereotypes. Obviously, it’s just as silly to believe ‘women good, men bad’ but I don’t know any women who really do think that (I’ll grant they may exist). However, I do sometimes meet men who think the opposite, so that’s more of a problem in my view.

Leave a reply to Chris W Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.