An Essential Big-Data Checklist For The Rest Of Us
By Amir Behbehani
By now, you’ve likely heard the terms ‘Big Data,’ ‘Data Science’ and ‘Machine Learning.’ But, what do these terms really mean for businesses?
These terms collectively refer to the process by which meaning, insights and predictions are extracted from raw data. Essentially, data science teams use machine learning to analyze big (amounts of) data, with the end goal of improving business objectives like increased revenue or decreased costs. It’s not as complicated as you might think, as long as you plan ahead, anticipate common pitfalls and ask the right questions.
My company Serial Metrics has built analytical solutions for many companies over the last 7 years. Over this time we’ve developed a checklist to keep in mind when starting to build a big data solution. The following has saved us time, money and a slew of headaches.
Communication: Unclear Questions or Outcome Metrics
A fundamental challenge facing data scientists has nothing to do with ensemble algorithms, optimization methods, or computing power. Communication – prior to any analysis or data engineering – is crucial for solving a machine learning problem quickly and painlessly. There are many questions machine learning (ML) can solve: It’s a powerful tool for making sense of data. However, these questions have to be specific and formulaic in a way that the people responsible for identifying the problem, such as management or marketing, might be unfamiliar with.
While useful for framing and approaching a business problem, questions as posed in a real-world environment are often too vague to translate directly into ML modeling. Because of this, it is crucial to communicate effectively between different branches within your organization: the ‘small’ question being solved by ML modeling has to match the ‘big’ question that constitutes the business problem itself.
One way to break down these big questions is to restate the business objective in the form of a likelihood. For example, if you’re trying to increase revenues, instead of asking “how do I increase my (total) sales;” ask, “how do I increase the likelihood that each individual prospect buys something?” This question sets up your data scientist to ask a mathematically easy question (i.e., set up a hypothesis) such as: “What is the likelihood that any individual user will buy something given features that we can mine from our data describing users on an individual basis?”
Feature Engineering: Getting More Information Out of a Data Set
Feature engineering and feature selection are important parts of any ML task. Even with sophisticated estimation algorithms and inexpensive computing capabilities, data scientists can play an important role in creating a model that is both accurate and efficient. Both time and energy should be spent looking over the data itself to try and identify additional information that may be hiding in the features already included.
For example, the difference between two values (say, the length of time since a customer’s most recent transaction) might matter more to predictive accuracy than either of the values themselves. This means that feature engineering is a combination of subject matter expertise and general intuition: skilled feature engineers can pull the maximum amount of useful information out of a given set of input data, giving an ML model the most informative data set possible to work with.
Logistics: Budgeting Computational Resources
Few things are more frustrating than spending hours preparing a data set only to hit an ‘out of memory’ error when trying to build the finished model. Budgeting computational resources for ML estimation can be tricky: over-budgeting can waste money, but under-budgeting can cause bottlenecks in construction or deployment.
However, cloud computing has taken dramatic steps towards making computational pipelines more expandable. Using a system like Amazon Web Services allows for the deployment of larger virtual machines (or a greater numbers of machines, if working in parallel) with relatively low cost and high speed. This type of elastic-computing framework makes it easier to budget appropriately when setting up an ML system, especially when working with large data sets.
Generalizability: Conflation of Training and Testing Data Sets
Particularly for those just getting into data science, this can be an easy step to miss. ML models are built for estimation: their purpose is to intake new data and generate values that can be used to guide future decisions. Because of this, it’s crucial to separate ‘training’ data that is used to fit an original ML model from ‘testing’ data that is used to assess the model’s accuracy. Failure to do some type of out-of-sample testing can result in a model that looks fantastic in terms of accuracy and fit statistics, but fails miserably when faced with new, unfamiliar data.
Generalizability is key to creating usable long-term ML solutions. As such, models need to be tested on independent, out-of-sample data before being put into regular use. A solid rule of thumb is to hold back 20-25 percent of the original data set: this is testing data, and should be kept entirely separate from the 75-80 percent of data used to build the ML model itself.
Focus: Algorithm Choice
There’s a huge range of algorithms available for ML problem solving. Random forests, support vector machines, neural networks, Bayesian estimation methods – the list goes on. However, the question of what algorithm is best for a given ML problem is often less impactful than you’d think. While some approaches work better than others for certain questions, it’s rare that one modeling approach will dominate all others for answering a given question.
A useful middle ground for selecting an algorithm is to build a group of robust modeling approaches that can be built quickly and easily for day-to-day use. Running a battery of models on a given data set allows a data scientist to pick whatever approach has the greatest marginal gain on that particular data set. However, going far afield for exotic new algorithms, or adopting different programming languages, is rarely necessary or even worth the time.
Many companies that are looking towards big data or dealing with big data issues may benefit from a streamlined data process; through proper planning and asking the right questions, a company can gain optimal efficiency.
Amir Behbehani is the founder of Serial Metrics.
标题：An Essential Big-Data Checklist For The Rest Of Us