Discovering truths in data

Posted onJune 29, 2014 by Christopher Berry

Discovering truth in data always begins with you, and your judgement.

Assume that you have some idea about the world. Something that you believe is true, and you want to discover if you’re right.

Here’s how I draw out that out.

It becomes a matter of organizing a dataset along those thoughts.

I call causal variables X1, X2, X3… I call the single variable that I’m trying to explain the Y variable.

There can be only one Y variable. For your own sanity, there can only be one Y variable at a time.

There are a large number of tasks to figure out if X1, X2, X3 cause Y.

One of them is to run any one of the many correlation algorithms out there, which apply to different types of data.

Those procedure generates a lot of data, which can be summarized in a matrix like the one below.

In it, a, b, c, d, e, f, g, h, and i all represent the causal arrow between the variables. For instance, if c was high, you could draw a line from X1 to Y.

So a matrix like this:

Is related to a graph like this:

And, that’s a model.

And that’s a particularly clean model that, with a specific type of data, can be summarized into a very clean equation Y = M1X1 + M2X2 + M3X3 + B. The world is rarely linear though, so different algorithms can yield different ways of making predictions.

If only the world always returned such clean statements of causality.

What is more common is a matrix like this:

Which can be read like this:

These models are far more finicky – they require a different approach to grapple. But if you’re aware of the relationships among the variables, you can handle it. There’s less of a chance of getting fooled. Or fooling yourself.

Finding truth in the data begins with you. It begins with understanding your relationship with the data, organizing that out, and then discovering if you’re right, or, maybe, quantifying just how far off you were.