Navigating the Data-Driven Quicksand – 2

This is the second part of a series of posts that should cover Anti-patterns in Data Science. Read the first part here.

In this post, I’ll cover the first set of anti-patterns that on should avoid, or use with utmost care. The discussion will be split along the lines of the various blocks described in the first post, namely:

  1. Problem definition
  2. Experiment Design and Data Collection
  3. Model Building
  4. Inference and Interpretation, and finally,
  5. Engineering

0.0 On picking meaningful subjects

Even before we get into “Problem definition”, a very common trend I’ve noticed is that a discussion with folks in the industry usually spins off around subjects that are, in reality, an offshoot of already existing literature. The figure below summarizes this. I’m not suggesting and neither am I apathetic or averse to the subjects at the bottom of this hierarchy – just that I’m doubly careful about what occurs there. If one were to solve problems involving data, signals, and information, it helps to build a strong foundation in subjects that are strongly grounded in academic pursuits than the ones that are off shoots of marketing to gain credibility, popularity, or funding. I’ve heard that even academia is strongly coloured these days by short-term financial interests, but I have always felt that it is still to be held in high regard as the objective is to pursue knowledge for its own sake and the years of learning always has a lot of value.

MeaningfulDefinitionsOrLackThereof

0.1 Tools != knowledge: It is only as good as the one who uses it…

Writing code, knowing scikit-learn, or theano means that you know about a tool. That isn’t sufficient. It always pays off to know what you’re doing – so a strong focus on theory is just as essential as focusing on tools and software engineering. I have worked in teams and conversed with folks who separate out the “engineering” and “science” bits of work. While it may be required to scale up work and to get a “team” involved, it is not an excuse for one to not know about the other. As described in the previous post, it is the “sweet spot” we are trying to identify: between research, process, and engineering. Everybody has to have some knowledge (although in different mixes) and not dust some question under the carpet because it is not their “domain”

Now, let’s go through a few anti-patterns categorized by the blocks described in the process diagram in my previous post; I’m pasting the diagram below, again, for easy reference.

DataScienceProcess

1. Anti-patterns in Problem Definition

a. Assuming the biz guy has gotten it right

It is not very infrequent that an organization with a strong sales force, marketing team, or a non-technical DNA frames the problem for a technical guy to solve. Often, the biz folks are assumed to have a good understanding of the market, needs, and therefore know what problems are to be solved and in what order.

This could not be further from truth: Yes, the data procured from valuable conversations with peers, customers, and vendors are a very vital source of information. However, its interpretation should be done by both the biz and tech folks, together. There can be real situations where a sales team come up with “wacky” problem statements: A client had once asked for access to a customer segment containing women who have had their shower,  are at their make-up desks, and between 20-30 years old. Now, I’ll have to have someone peek into a lot of houses to validate my segment! 🙂

Another time, I was told to identify all customers of a company that dealt with a very specialized product in a small country, on social media in an “online manner”. Yes, that is valuable. But is the effort worth it? Couldn’t we just call them? One has to be open to these suggestions and dropping the “data driven” approach altogether if a more commonsensical alternative exists. A lot of the times this problem occurs when the sales team has grossly oversold an offering.

b. Relying on Anecdotal Evidence (aka drawing numbers out of the hat)

One can lie, mislead, or be misled by relying on anecdotal evidence that are derived from sloppy experiments and off-handed hacks.

For instance, if a variable has an impact on the performance of a module, instead of randomly trying out a few values for it and reporting observations, it is always good to document assumptions and context, vary the variable in a meaningful manner, and tabulate before drawing conclusions.

Customer segments in the DMP industry is another good example of this problem (A related question I’d asked on Quora): By attempting to identify customer segments that can never be validated directly, nobody knows what’s going on, but every body says they have some “sense” of it.

Developing a discipline in conducting carefully defined experiments to analyse variables you’d like to use is key.

c. Using “Arbitrary” Standards

A lot of the times, in deciding how good or bad we’re doing, we set standards for ourselves and for the rest of the world. Under practical conditions, one may need to set a threshold on some quantitative variable (like the minimum TOEFL or SAT score, etc.). However, a culture of reliance on standards set by somebody else is also detrimental to understanding a phenomenon holistically. For example, in one of my previous posts, I had mentioned that an arbitrary threshold is set on the “p-value” to determine statistical significance. So, p=0.009 is significant, but p=0.011 is not. And whole bunch of inferences are drawn based on such standards.

A better way to deal with this would be to bring out the standard used, factor in the uncertainty into the inferences, and also transparently display how a data point stands with respect to other data points for the given metric. “Subjective” decisions like “standards” should be a part of the “interpretation” and not of “problem definition” or “experimentation”.

d. Incorrect “Comparisons” (aka comparing apples with oranges)

This one is hard to detect. A lot of the times we compare outcomes that have occurred in different contexts or under different assumptions. For example, comparing the verbal ability of people across languages may not be fair because one language is more complicated or nuanced than the other. One has to be observant and catch this fallacy early on. A way around the issue is to keep track of assumptions and contexts of any phenomenon we study.

e. Assuming the outcomes of previous experiments will hold

Scientific research relies on building upon the foundations laid down by previous work. While this is essential for progress, one has to be aware and contemplate on the methodological limitations that might have occurred in previous work and how it might have an impact on the interpretations made. For example, ignoring results that were previously  statistically insignificant could lead to incorrectly undermining a result that might show up in future work.

Overcoming this would mean that one spends time understanding methodological limitations of experiments, being bold and honest about accepting the outcomes of ones own results, and having an open mind. Having a list of good journals/conferences that have strong reviewing processes, reading several works of a researcher to understand experimental rigour are some steps that could also practically be of help.

f. Non-compliance with law and ethics

A lot of the times, we just want to get things done – without the hassle of going through painful paperwork or asking tough question to ourselves. This is the case of the ends justifying the means.

Non-compliance with law is seriously dealt with larger organizations that have clear policies of what can be done and not done. But in general, it always pays to be cognisant of the legal framework and obligations.

Additionally, although debatable about what they are, my personal take is that the problem solver (I’d like to avoid “industrial titles” like a researcher or an engineer) should be very interested to perfect his moral compass. This helps do work that does not harm others and also bring a sense of self-satisfaction and pride in what one does. This is often ignored in the aforementioned “lets get things done culture” that in all probability may not yield good long-term consequences for all three: the individual, organization, and the society.

In my next post, I will take up more specific points around experiment design and model building.

2 thoughts on “Navigating the Data-Driven Quicksand – 2

Leave a Reply

Your email address will not be published. Required fields are marked *