Hilary Mason's ApacheCon Keynote: 3 Ways to Improve Data Science
Data science still has a long way to go in developing systems that solve real-world, human problems, said Hilary Mason, data scientist in residence at Accel Partners, in her keynote at ApacheCon in Denver today. The open source community will be key to helping big data evolve into a more accessible technology, she said.
This is the key challenge facing the field of data science and a contributing factor in the rise of the professional data scientist. For the first time, math and statistics, coding, and analysis have come together into one job description with the sole purpose of making data useful, Mason said.
But the rapid evolution of computing power has outpaced the tech industry's ability to analyze vast quantities of data now being generated, Mason said.
“Most of what we (data scientists) do is count things, or at least count them cleverly,” she said. “Deep analysis is still hard.”
The state of data science
At the turn of the last century a computer punch card could barely hold enough data to store the characters in a single Tweet in 64-bit ASCI. Back then Computing-Tabulating-Recording Company, the company that eventually became IBM, was on the cutting edge of data science with the invention of a machine used exclusively for counting people.
Today even a tiny computer the size of a Raspberry Pi can store and analyze terabytes of data and tools such as Hadoop are available to anyone via a web download. But while the tools and technologies are much more advanced, the data engineering process most companies use to analyze data is still inefficient, Mason said. A typical company repeats the following process until something useful emerges:
1. Research offline using the data you have and the problem you want to solve.
2. Do fancy math and find the analytical shortcuts to find the right approach to the problem. This involves running a job in a Hadoop cluster or playing with it in Python or R.
3. Design the infrastructure and deploy that to see where it fails.
4. Re-design to run at scale and speed.
“You end up writing a lot of code that gets thrown away just to count things, given the overhead of the infrastructure,” Mason said.
3 Ways to Improve Data Science
Mason is, however, optimistic about the future of data science. There are already many examples of big data success including the Dark Sky app, which takes public U.S. government weather data to create a micro-forecast and send a mobile alert when it's about to rain in the user's location, she said. She also cited Jawbone's use of an app that measures the number of steps and hours of sleep, to demonstrate that daylight savings time caused the average American to lose 11 minutes of sleep – or millions of hours of sleep when scaled to the entire country.
"Maybe you can learn something from these little devices we wear for personal edification... to influence policy," Mason said.
Companies are still at the beginning of figuring out what they can do with data. Meanwhile, our technical capabilities are still growing, including advances in the infrastructure used to store and retrieve data, and the software tools that accomplish complex analyses. To build more accessible systems and improve the human side of data science, Mason pointed to three key areas of improvement:
1. Data systems should use natural language – translating computer languages into plain English and vice versa.
2. Data scientists should work more closely with hardware designers. As our sources of data expand from social media and the web into the real world and embedded systems, hardware will be increasingly important to solving problems through data.
3. The community around data science and data technology must grow. “None of this happens without a strong community.”
"Data is making us smarter and data infrastructure is making it possible so let's make more of that," she said.
ApacheCon is happening April 7-9 in Denver. Follow #ApacheCon on Twitter for live coverage.