There was a time when the tools and techniques we call Data Science belonged to a select few initiates. Their secrets were jealously guarded by those with advanced degrees, expensive SAS licenses, and even more expensive training. To extract value, an organization was required to outlay large investments in capital, expenses, and personnel. Those that had the size and scale required reaped huge rewards in the marketplace while the have nots suffered greatly.
In 2012, Harvard Business Review declared Data Scientist as the sexiest job of the 21st century. In early 2014, Burtch Works released its first comprehensive survey of Data Scientist salaries and found a weighted average annual salary of $137k. 2017’s version of the same study produced a weighted average of $150k for a 10% gain (hardly the salary growth one would expect given the skyrocketing demand for the profession…unless there was a commiserate increase in supply).
This supply increase represents the Democratization of Data Science.
Three main forces are driving this newfound accessibility: new tools, packaged scripts, and of course, new people. Many of the barriers that traditionally kept Data Science scarce and holy have been broken down.
Quite a bit of ink has been spilled on the topic of SAS versus R/Python over the years. There was a time not so long ago that predictive models required proc logistic data = mine; and damned are those who forgot the semi-colon. Before you could write your SAS procedure, you had to get a license for base SAS, SAS Stat, and potentially a host of other expensive additions. If you wanted an enterprise installation, you needed an enterprise budget. To learn SAS, you were taught as part of grad school, were tutored by existing programmers, or went to a number of locations for multi-thousand dollar classes offered by Dr. Goodnights’ employees. Open source code has finally broken through. It is just as easy to type glm(dep~ind, data=mine, family = binomial ()) or from.sklearn.linear_model import LogisticRegression before specifying your model from there. The big difference here is that R and Python cost nothing to install.
Packages and libraries are just the tip of the iceberg for coding help. Stack Overflow, GitHub, Kaggle, and Google all yield a wealth of code from snippets to fully contained scripts that allow today’s data scientists the ability to analyze and model data quickly and efficiently. Personal, company, and public code bases for solving many common problems enable getting more work done faster and for less. This also extends well into the data cleansing and manipulation tasks as companies such as Trifacta and others facilitate and ease once onerous chores. The dream of fully-automated machine learning is here. It just hasn’t been realized…yet.
Advanced degrees are still expected but the availability and variety of academic programs that contribute to the development of data scientists have been increasing. In-depth online training is available for free and is uses real data sets for users to hone their skills. People are flooding in from adjacent functions; DBA’s, developers, data analysts, and the like are all upgrading their skills to fill increased demand. They bring a variety of expertise, domain knowledge, and fresh perspectives which oftentimes yield novel approaches to problem solving.
So, what does all this Democratization of Data Science mean? Quite simply, we can dive further down the marginal return curves than ever before and apply data-driven modeled approaches to problem solving…at scale. Data Science is no longer limited to solving billion-dollar problems. Incremental improvements on several million dollars are enough to offset the costs of the analyses. During 2018 and beyond, these trends will continue making data science increasingly accessible to the masses.