Navigating the Data Driven Quicksand – 4

This is the fourth and final part of a series of posts on “Anti-patterns” in data science. View the previous posts here: 1 2 3

In this (final) post of the series, I will cover a few points to be careful about while building models and drawing inferences and interpretations based on such models.

Model Building and Engineering

1. Incorrect hold-out/training/test data splits and not using Cross-Validation

Data available for “learning” a model from, is often split into training, hold-out, and test datasets prior to building the final model, for a step known as model selection. In this step, the training, hold-out, and test datasets are usually mutually exclusive and partition the entire dataset that’s available. The training data is used to “learn” the model, the “hold-out” to tune the free parameters of the model, and the “test” data to finally evaluate the models. In cases where a lot of data is not available and/or parameter tuning is not required, the “hold-out” data is not made available. The “hold-out” data is also sometimes referred to in the literature as the “validation” set.

This entire process of splitting the data (possibly multiple times) is known as cross-validation. There are several schemes of cross validation available and it is up to a modeller to select the right one.

While each cross validation scheme ensures that there are no overlaps between these different partitions, it is also important to understand that one partition may accidentally be used wrongly. This occurs, for example when we’ve skipped having a validation set and instead tuned the parameters on the test-set. The model’s performance on the test set in such a case will probably be inordinately higher than a “real” scenario. Another example of this happening is in studies that assume the presence of some “oracle” that magically gives them statistics pertaining to the set against which a model is being tested against.

Also, principled cross validation for tuning models and benchmarking performance to pick the “best model” is an absolute pre-requisite to getting a good solution. A single split of test/holdout/training sets is not sufficient to indicate the stability of model.

Lastly, although uncommon, new “students” of data science also sometimes rely on performance of a model on the training set – this is a strict no-no: a model can always over-fit to the training set (some non-linear ones would even give a near perfect performance, if desired)!

2. Assuming population distribution is true for test sets

One has to be aware of the fact that the model’s predictive performance can strongly depend on the distribution of classes in the training population. If the same assumptions do not hold in the test set, the results can drop drastically. To alleviate this problem, re-training and re-computing population estimates are essential.

There are a few interesting ideas around identifying population distribution without running full-blown classification: such as Kings et.al. Do take a look at them.

3. Making up for lack of data with complex models/over valuing model-representation

There are three parts to any model building:

  1. Defining the evaluation metric: which is closely tied to how we are going to measure outcomes of solving a problem (as defined in the previous post).
  2. Choosing the model representation, which is essentially the mathematical function that maps the input to the output, and
  3. The optimization process: Which is a rule-based or mathematical procedure to arrive at a solution to the problem that is “optimal” in some sense. Although previous work [Domingos’12] refers to this part as just having to do with routines like stochastic gradient descent, conjugate gradient descent, quadratic programming, etc. I will go a level higher by including the experimental procedure used for model selection: the cross validation used, the feature selection process used, and any other meta-optimization procedure used like hyper-parameter optimization, annealing, etc.

All these three parts equally define a solution. Just making up for a lack of data by using a complicated model representation may not yield good performance on the test set if the process generating the data does not fit the assumptions of the model well. Therefore, blind reliance on “non-linear” methods with “overcomplicated” functions is not a solution to a lack of data.

4. Ignoring training and test-time complexity

While choosing a model for a task, it is also essential to keep in mind the “engineering” and “usage” context so that we choose ones that can scale to the desired size. In a very simplistic sense, to understand what I mean, consider the following cases: if memory is highly constrained, using parametric methods may be better; extra training time may sometimes be an acceptable trade off if a model can yield, on an average, lower complexity of computation during evaluating/predicting the class of data point (an argument put forth to support relevance vector machines over support vector machines, as the former requires higher training time and is claimed to provide higher sparsity in choosing “representative vectors”). This design choice is also affected by whether real-time processing is necessary, the computational resources available in a context (such as wearables), etc.

5. Hard-coding parameters, labels, and models

This one has to do with implementation or engineering a pipeline. Data science frameworks have to always be written in a way that parametrizes parameters, labels, and models so that each of these can be replaced with different values or representations without modifying the code. All three aspects of model building described in 3. will change during the course of “getting it right” while the core experimental framework remains fixed. Therefore, the best engineering value is derived by rightly parametrizing the three parts.

Inference and Interpretation

1. Using statistical significance incorrectly

Discussed in detail in my previous post.

2. Incorrectly alluding to authorities

One has to refrain from alluding to results form well-known people and highly cited articles incorrectly: this could be in the form of selectively quoting results, ignoring the context of results, and blind faith in the results obtained by a popular study.

Such blind reliance on authorities can also have a “reverse” effect: our studies can sometimes not be accepted by the community at large because it deviates from a study that is from a well-known/popular person. In such cases, although it’s good to go that extra mile to verify our results, we should not conform unless we find the reason to believe so.

3. Selective treatment of results

Biased reporting is a problem when reasons like the need for funding, fear of time being “wasted” after spending a lot of time on something, pressure of publishing, and sensationalism makes us cherry-pick results within our work and elsewhere to bring out a story that we’d like to believe is true (when it is actually not). Although this one is obvious, it is good to keep this in mind while evaluating the interpretations made by any work.

4. Lack of separation between observations from interpretations

The last point to take note of, is to separate observations of the experiment (which are measured using the models) from interpretations, which is our learning and inferences drawn from the study. This will help us to replicate results, match observations and assumptions, pin-point potential errors, and build upon the obtained results in subsequent studies.

This brings us to an end of a general list of “anti-patterns” in data science.

Leave a Reply

Your email address will not be published. Required fields are marked *