Many years ago I researched explosives by shining a light on them. It was every bit as exciting as it sounds. We would shine a light, take a picture, then study the explosive to see if it changed. I would painstakingly scour thousands of data points, looking for small fluctuations in intensity, signs of discoloration, or any statistically significant feature. We collected immense amounts of data from sensors, but the explosive always looked the same when we took snapshots. Then eventually we found out that if we looked not just at the snapshots, but also at the differences between the snapshots using a mathematical formula, we could see dramatic changes. We found out that every explosive was different, and we could effectively detect an explosive from a distance by just shining a light. Today, that research is being used to scan people before they enter airports for bombs.
Today, companies have more customer data than they can handle. Like a digital version of the show Hoarders, companies try to keep every bit of detail for as long as possible with the hope that one day these useless bits can be turned into massive new revenue opportunities. Over the past five years, bright engineers have devised open-sourced solutions to store and process the data deluge. We now even have a “big data stack” — that is, a framework for commoditizing data.
At a previous job, I was essentially the housekeeper of our home-built data infrastructure based on Hadoop. The idea was to create a monolithic tracking system that would record all anonymous user data, batch it up and store it in a distributed store. We’d then process and query it to find fascinating facts about our users that would help drive product roadmaps. At least that was the idea. Sound familiar? In practice though, we weren’t exactly sure what to track. Which user attributes were meaningful? How would we identify statistically significant behavior? Were we sure of data accuracy enough to influence the product roadmap? It turns out that we–and many other companies it seems–neglect a major component of the “big data stack”: science. So let’s modify our framework.
This diagram may seem a bit strange if you’re a software engineer. Because typically when you talk to engineers about big data, you’ll hear a litany of tool sets that sound like characters out of a Harry Potter novel: Voldemort, Pig, HDFS, Oozie, Zookeeper, Flume, Hive, Cassandra… you get the picture. We have yet to get to a point where science can be commoditized, and perhaps it never will (though Mahout is a step in the right direction). Despite all these tool sets, scientists will always be needed for their intuition, interpretation, and curiosity. Scientists are needed to analyze the business needs of a customer and ask the right questions to solve critical business problems. Scientists are needed to transform an ugly piece of log data into a beautiful infographic that can spur an organization to launch a new product, bolster existing services, or otherwise remain nimble in a highly competitive economic environment.
These scientists are not scientists in the traditional sense. Their domain isn’t necessarily their bachelor’s, master’s, or dissertation topic. On the surface, my background in explosive materials doesn’t sound like it helps me with my day job of helping clients understand user behavior in their mobile and social applications. Instead, I’ve learned that “data scientists,” no matter what their background, specialize in providing insight by using keen analytical and quantitative skills. If needed, they will clean, explore, and model data sets to create new information products and key metrics. These scientists are not in a cubical doing mundane research towards an elusive goal. They are highly collaborative and “high-touch,” that is, they constantly communicate with a key stakeholder so an end goal is reached.
The scientific process and research I conducted on explosives was instrumental in creating a product that went to market. The organization I was a part of taught me to be curious and to look at data sets in new ways. We had the infrastructure to store, process, and query the data, but ultimately it was our insight that produced a working prototype. In today’s data-driven world, the organizations that best leverage their data and invest in the right people to derive insights from that data will gain large competitive advantages over organizations that fly blind.
——————————————————————————————————————–
About the author: Chris Bates is a data scientist and founder of FitLabs, a Web application that measures your fitness. He has a background in engineering physics, materials science and computer science. Chris wants to encourage people to think about their data in interesting ways: “If we learn to ask the right questions, then we might uncover some interesting aspects about our life.” Check out his blog, The Data Scientist.
9

Excellent intro, Chris.
then what is science ?
a very insightful post. I’ve recently been thinking the same thing. everyone is talking about big data, big data, big data and the job descriptions love to have the term “big data” in their description but unless there’s someone in the organization that knows what they want to do with the data, it’s all BS.
I’ve very rarely ( maybe never ? I don’t remember ) seen examples of where big data is used to actually do something useful for an organization. the buzzwords are out there: machine learning, prediction algorithms, etc but how much of it is real and how much is buzz ? I’m not so sure.
I find that you talk about ‘ideas’ and ‘questions’ it helps people get into the right frame of mind, when facilitating enquiry.
My own view of science is very broad, and it definitely includes the activity of undertaking intuitively led (but nonetheless guided) fishing trips. But for sure not everyone shares this view.
[...] Big Data is Useless without Science RT @jrecursive: Big Data is Useless without Science – http://t.co/JuDvAYlL... Source: kaleidoscope.kontagent.com [...]
[...] Big Data is Useless without Science Amen. Er at least for scientific procs RT @ChrisDiehl Big Data is Useless without Science http://t.co/lrv2NJEH On point. Source: kaleidoscope.kontagent.com [...]
[...] http://kaleidoscope.kontagent.com/2011/11/09/big-data-is-useless-without-science/ [...]
[...] recent blog post by Kontagent Kaleidoscope about Big Data is Useless without Science got me thinking about my role as a self-proclaimed Data Scientist. The blog article points [...]
[...] the massive amounts of data. (Here's a good read: Big Data is Useless without Science: http://kaleidoscope.kontagent.co...) You need access to the right data, and then you need the SCIENCE to read it effectively in order [...]