In this article, I cover a few points to be careful about, during experiment design and data collection. Again, as with the previous two articles, the block diagram describing the “process flow” for solving a problem is reproduced below.
a. Picking a non-measurable outcome
An outcome defined at the problem definition step should be a phenomenon that can be measured using tools that exist or that are to be built as a part of the solution. A hazy definition that cannot be put down into a “process” should not be used. This will ensure that the hypotheses that we define can be evaluated and the best one picked.
An example would be that of extracting psychological constructs like “happiness”, “aggression”, or “intelligence”. Having qualitative definitions are just not enough in this case: It should refer to an outcome that is measurable quantitatively, say, on a Likert scale, through a proper calibrated instrument; if we’re dealing with “intelligence”, we could go with the “Wonderlic” inventory or its derivative that measures intelligence quotient.
b. Relying on subjective or self-reported measures
One has to especially be careful when dealing with measures that inherently are subjective or are self-reported. Self-reported measures are affected by factors like self-deception and lack of self-awareness. For an “engineering solution”, these measures are harder to model as all variables that are required to factor in the subjectivity are usually not available.
For example, let’s say we need to analyse users of social media. Happiness, love, sadness, aggressiveness, extraversion of users are all measures that are not obtained directly as “ground truth” through some variable. They have to acquired through “instruments” that show some correlation to a behaviour in a scientific study. Also, the quality of measurement is dependent on (a) the quality of the research work that has led to the instrument, and (b) the applicability of assumptions of the research work, in the current scenario. Using machine learning or analysis on such measures usually turn out to be difficult. Alternatively, identifying interests in certain topics, ability to influence conversations, being an early adopter for electronic devices that might have a “reduced degree of subjectivity” and hence may be captured more reliably.
c. Relying on a single metric
A lot of times, reducing the performance of a system down to a single number is a bad idea. Multiple metrics should be computed (and later used for interpreting results) wherever possible.
As a simple example, let’s take a binary outcome with one of the outcomes being highly unlikely in the first place (say, the likelihood of getting a rare-type of cancer). In this case, a solution that measures the outcome with just accuracy would not be sufficient. The “class imbalance” of there being more “non-cancerous” cases has to be captured and handled by the metric. Therefore, macro-averaged accuracy and F-measure, and per-class accuracy, precision, recall, and F-measure may also have be computed for proper understanding of the performance of a model.
d. Replacing outcomes with inaccurate proxies
A lot of the times, the desired outcome (as discussed in a theory) is not measurable with the tools available (or to be built) due to a variety of reasons. In such a case, a proxy may be used, in place of the outcome; a “proxy” captures “approximately” the required phenomenon. This however has to be done with caution.
For example, instead of capturing “frequent online shoppers”, one may resort to picking all users in a dataset that have “credit card information stored on file”. This is a valid proxy if we believe that most people who have this information actually shop frequently online. If, on the other hand, a product under question collects credit card information mandatorily, then this proxy may not be very useful.
a. Collecting data without a process document
In the thick of solving problems, skipping the preparation of field notes, technical reports, and process documents is often overlooked. Especially a combination time-pressure and lack of business/product direction in an organisation can result in this problem. Although a time consuming task, documenting everything we do to solve a problem is absolutely essential. Otherwise, our experiments become ad-hoc, lack structure, and have a multiplying effect in terms of adding noise to our interpretations.
Furthermore, “process documents” for collecting data – be it independent variables that capture a certain behaviour describing a phenomenon or the outcome (labels), carry a lot of importance. This describes the assumptions made and definitions of our outcomes in concrete forms. If labelling is being done, it describes how it should be done – this ensures consistency across people who are collecting data.
b. Relying on poor labelling processes and not accounting for noise: e.g. Crowdsourcing
If we are procuring labelled data, with the labels having a good deal of “subjectivity”or “complexity”, labelling processes have to ensure that the measuring instrument is stable, the data obtained is clean, and if not, there is some metric by which “clean” or “reliable” data points can be picked up for further analysis and the bad ones can either be rejected, corrected, or sent back for re-labelling. Majority voting, annotator modelling, Cohen’s kappa, IRR, etc. are some ways to get around this issue.
This problem is especially accentuated when dealing with crowdsourcing platforms like Mechanical Turk; this is because the large scale of annotations will always lead to low SNR, adversarial annotators (say, due to financial motivations), etc.
Additionally, capturing complex phenomena like sentiment, dependency labels, code-mixing, etc. that either require domain expertise or a complex/sensitive process is better done with a dedicated “knowledge engineering team” and not through crowdsourcing (as human errors in an already complicated/approximate process will almost certainly yield poor quality data)
c. Sampling errors and problems with self-selection
Once the type of data to be collected has been decided upon, the first step is to decide on how to “seed” the collection process – either by issuing offline invitations, by picking “influencers” in social media, or by posting advertisements on the internet or elsewhere. This choice of the channel to be used for procuring data points has to be done carefully and should reflect the type of data points to be seen in the future, under “production”, “real”, or “test” settings.
For example, it may not be a very wise idea to get people thorough “advertisements” for a study on differences in behaviour between “frequent shoppers” and “infrequent shoppers”. This is because only relatively “frequent shoppers” might respond to an advertisement thereby creating a bais within the sample being used for analysis.
Wording of invitations, advertisements, or rules used to sample from a population may also have significant effects on the interpretation, assumptions, and even metrics like the p-value.
d. Reducing the amount of data collected because it is expensive
We can divide the difficulty in solving problems into three kinds:
(a) Difficulty in framing the problem clearly
(b) Difficulty in collecting data
(c) Difficulty in solving it due to low SNR (Signal to Noise Ratio) or because the underlying “process” or “phenomenon” is complicated
The (b) kinds are often dealt with, by having small sample sizes and using sophisticated analysis to prove/refute hypotheses. This can lead to many assumptions and limit the generalisability or usefulness of results. A learning I had by working on several problems in fields like psychology has been that the cost of data collection is not an excuse for conducting experiments with small amounts of data: The results just turn out to be very limiting in its scientific significance.
e. Collecting a lot of (meaningless) data
At a very high level, we may be inclined to approach problems in two ways: Let’s collect all the data that’s available and then figure out an issue with it, and find solutions (in lighter vein, this is akin to finding new “wants” in a consumerist world); Or, let’s find out an issue that someone has and only then define the problem, so that it allows for meaningful data collection to arrive at a solution.
In the most ideal case of the latter approach, the kind of data to be collected is defined well – and a lot of the times, the amount of data to be collected also becomes very manageable. In the former approach on the other hand, we end up with a lot of variables that we just process to get a “predictive” advantage; it may not really help us model and/or understand the knowledge pertaining to the problem in an interpretable way.
I’ll stop here and continue in my next post about building models and subsequently interpreting them.