Every analytics project by definition will require some level of data to be sourced, transmitted, organized and analyzed. If not the largest part of a project, certainly it is often the most painful. In fact, one of the greatest challenges we see companies faced with is the implementation and maintenance of a sustainable, successful data collection process.
Our work across clients and industries has shed light on the three steps that lead to successful data collection.
The Three Steps:
Step 1: Collection
Often, data collection is misunderstood as the step where one collects all the data and dumps it into the data lake. In fact, if data is collected the wrong way – and there is a wrong way – then it may end up needing to be collected over and over again. In order to be successful in this step one must start collection with a strategy that addresses the kinds of business and operations decisions that will be supported by data.
After these larger goals have been identified, one can work in identifying the data needed that will help answer and support these goals. Once collection begins, small organizational tweaks like providing data owners with templates and unifying data sources electronically can help the data to be in its best shape for Step 2.
Step 2: Processing
In order to extract something meaningful from your data, it must be mapped in a way that allows you to analyze, compare, and contrast it alongside other data sets. This is where processing comes into place.
Processing is a complicated skill that requires multiple sub-skills, such as tech expertise, mathematical and statistical knowledge, and data and computer science. To build good data sets you need the right skills plus a detailed experience of how the data is going to be used, in order to collect and blend it together in an efficient, effective, and repeatable way. This is not always trivial.
Step 3: Validation
Just because you’ve collected and processed your data, doesn’t mean it’s an accurate representation of what actually happened. Validation is a crucial step that verifies the accuracy of the data set. Oftentimes we see breakdowns during this step because the people who are working with the data are not necessarily the same people who are equipped to validate that it’s correct!
The companies who are successful in this step do so by creating a standardized process of outreach that includes all necessary info needed to validate data correctly. Such info includes but is not limited to: description of data, what it’s designed for, what it’s capturing, anomalies, cross-validating sub-totals, checking against business logic, and general trends over time or cross-sections. Of course, formalizing and finishing the process with a verified sign-off from the subject matter expert can help uphold accountability as well as allow all future mistakes to be tracked back to its source.
Implementing a consistent and responsible data collection process using the three steps above takes hard work but when done correctly can benefit your company immensely. Quicker model refreshes, on-going forecasting and model diagnostics and other ad-hoc analysis are just a sample of what you can do effectively once your collection process is finessed.