It is imperative to understand what it means to go “Data-Driven” and why and how one should be careful about it. In a series of blogs, I’d like to summarize some learnings I had while navigating the “Data-driven” Quicksand: A wonderland where faulty analysis can lead you to misleading and dangerous interpretations of how world works, while the right one can provide scalable solutions to tough problems.
First, I’d like to share my take on what solving a problem with data is all about.
One can come up with an innovative solution to a problem through “intuition, learning, and creativity”. This constitutes the “Research” component of a solution. It need not necessarily follow a “Process”. However, innovative solutions should also adhere to the cornerstones of scientific thinking such as repeatability, testability, and a certain degree of objectivity as it brings in robustness and clarity. This can be easily achieved by adhering to a set of standard practices (without abusing them!). Therefore, a certain degree of process is required although it brings in rigidity and eliminates (incorrectly or correctly) a certain solution space in favour of increasing the “average” likelihood of the solution being reliable and replicable. Finally, another constraint to the solution is how viable it is from the perspective of “implementation” or “instantiation” using engineering resources. This is essential to actually solve the problem at scale or otherwise. Each of these three aspects namely, research, process, and engineering with the right mix would be the “best” solution for a given problem.
Now, before we delve into all the things that can go wrong, let’s understand the basic work flow for solving a “data-driven” problem:
1. Broad Business Problem
Pain points can be identified based on discussions or experience with customers. After ascertaining that a customer segment is large enough to make a tangible impact on a company’s offering, it is handed over to the technical folks to find a solution. At this step, the description is often:
(a) The context of when a problem occurs
(b) The emotions accompanying the problem
(c) The ideal effects of the outcome of a possible solution in qualitative ways – akin to a rainbow accompanying the rainfall ;), and
(d) Allusions to the amount of revenue it can generate
2. Problem Definition
Once the biz problem has been handed over, the tech team mulls over the problem, defines it in more concrete terms. The objective is to move from a “subjective description” of the problem that is coloured by emotions into a dispassionate description of facts, a study of whether it is “solvable” under the given conditions of data availability, ethics and legality, and resources – hardware, software, and human. If it is prima facie solvable, the details of what the expected outcome is, is also defined.
The outcome of this step would be:
(a) a short technical document of the facts pertaining to the problem.
(b) metrics that can be used to measure a “solution” for the aforementioned pain points.
(c) a definition of the outcomes of the “solution” that is measurable with the aforementioned metrics, and lastly,
(d) An enumeration of various data sources, resources – financial, humans, hardware, and software for investigating this further.
3. Experiment Design
Once basic due-diligence has been done to define the prima facie viability of the problem, one needs to enlist various hypothesis, define processes to collect data to prove or refute them, and also do a detailed literature survey on the topic. “Process documents” for collecting data, road maps for experiments, and resource allocation has to happen at this step. There are just too many things that can go wrong here — this is a part of another post, when we actually take up the topic of “Anti Patterns”.
4. Data Collection
Solving a problem with data would mean that we first have the data! So, at a stage where we have an adequately well defined problem, the most important task is to collect data from various sources that can help in the following ways:
(a) Help define the the viability of the problem better, by providing “ground truth” describing the outcome. (constituting the “dependent” or “target” variables)
(b) Indirectly be predictive of an outcome or a phenomenon to some extent, but easily collectible.
(c) Capture the “input space” that forms the starting basis for studying a phenomenon.
(b) and (c) constitute the “independent variables” in standard statistical parlance and “input space” in machine learning.
5. Analyzing the data and devising the solution
Now, the solution can take one of two forms:
5.1. Analytical: Where a small number of hypotheses derived deductively from theories or inductively from prior experiments are checked against the data collected using standard statistical techniques and consequently interpreted to drive a decision, that forms the solution.
5.2: Predictive/Learning-Based: Where an “artificially-intelligent” approach is used to automatically sift through many hypotheses to arrive at the best one that can predict the outcome from the “input space”; this is often done through statistical machine learning or expert systems.
Please note that largely, I will ignore qualitative research methods here – which are just as good as and sometimes better than quantitative research techniques (while dealing with analytical solutions) due to both, a want of space and to simplify the discussion.
5.1. Defining an Analytical Solution
In this route, standard statistical techniques constituting descriptive statistics, correlations, linear regression, statistical hypothesis testing, etc. are used to prove or refute hypotheses. The interpretations of the theory and “valid” hypotheses are used to power decisions. This inevitably changes the “state” of the system and the analysis cycle repeats. The main steps are:
(a) Establishing a set of hypotheses to be validated.
(b) Conducting experiments and extracting quantitative data that can be used with standard techniques like statistical hypothesis testing.
(c) Drawing Inferences or Conclusions: based on the outcomes of the statistical techniques on what hypotheses hold waters and in turn on how the “world of interest” works. This leads to the solution to the problem.
5.2 Defining a Predictive/Learning-based solution
Predictive or Learning-based solutions are based on the premise that complicated mathematical functions or a combination of a large set of logical rules can be used to predict an outcome. Further, the mapping from the “input space” to the “space of outcomes” is generally considered to be approximate and therefore is “learned” iteratively either through inferences (in the case of expert systems) or through stochastic optimization (in statistical machine learning systems).
Such an approach aims at arriving at the right choice of the following (and iterating along those lines):
(a) Representation of the Model: Which describes the constraints around the mapping between the input space and the outcomes. This could, for example, be SVMs, Logistic Regression, Linear Regression, etc.
(b) An evaluation metric: Which is a quantitative metric that measures how well the model performs and how the outcomes ought to be assessed for quality; examples of this include precision, recall, accuracy, nDCG, etc., and
(c) The optimization: Which is the process for arriving at the right “parameters” inside the “Representation of the Model”: this could be stochastic gradient descent, conjugate gradient descent, quadratic programming, etc.
Therefore, the process also entails considerable programming effort; depending on the scale of the problem, it may require tangible engineering effort as well.
A good reference describing the above abstraction is Domingos, 2012.
Now that we have understood the general process followed while arriving at a solution to a problem concerning data, in my next post, we shall investigate with examples, various pitfalls that one needs to be careful about while devising a solution.
Domingos, Pedro. “A few useful things to know about machine learning.” Communications of the ACM 55.10 (2012): 78-87.