Debunking Bad Big Data Science

Author: Prof. Bettina Berendt
Date: 13th March 2015

"The promise of big data is that we do what we’ve been doing all along – profiling – but make it better, less discriminatory, and more individualized. That sounds acceptable if the aim is simply to prevent unwanted actions. But it becomes very dangerous if we use big-data predictions to decide whether somebody is culpable and ought to be punished for behaviour that has not yet happened." (Mayer-Schönberger & Cukier, Big Data, p.166)

Why are arguments about Big Data such as this one so enticing – and at the same time so misleading?

And why does this matter?

"Sciencey-sounding"  but in fact misleading or even wrong accounts of science have very real consequences: People believe those stories and act on them. Witness the recent resurgence of deadly infectious diseases caused by an anti-vaccination movement that is inspired by a patently wrong (but repeatedly rehashed) health scare story about MMR vaccines leading to autism in children.  Ben Goldacre calls this Bad Science and is doing a great job documenting and debunking a host of examples, most from the health domain, in his blog and books.

It’s time we start a Bad Big Data Science Debunking Programme. This is an interdisciplinary exercise, in which we computer scientists should become much more active. As engineers, we’re used to constructing – but as scientists, we also have the duty to question and deconstruct! Here’s one data miner’s go at it: In the book-review-plus-essay Big Capta, Bad Science?, I argue

  • why Mayer-Schönberger’s& Cukier‘s Big Data should be approached with caution
  • why Kitchin’s The Data Revolution is essential reading for anyone dealing with data
  • that The Data Revolution leaves one question largely open: how to apply its many excellent conceptual insights

and propose

  • Bad Big Data Science Debunking as one answer to this question: applying a critical data science toolbox – briefly – to three of Kitchin‘s own examples (DNA measurements, rubbish collection, and "voluntary" social-media data donations) as well as – in detail – to the above passage about predictive policing.

This text is part of a larger project of Thinking-Through Big Data Ethics – so by definition it’s an invitation to debate! I look forward to your comments.