Hello cohorts, thanks for clicking on the methodology link.

I downloaded the corpus from Infochimps – here’s the link. (I don’t have the rights to upload it to BuzzData where it also belongs.) You can find the original Reddit thread here.

It contains 7,405,561 votes from 31,927 users over 2,046,401 links. It was harvested by way of the Reddit API in late 2010 or early 2011. I am not the author of the original file, and yes, sure, I’m a bit slowpoke about analyzing it.

I imported it into SPSS on a dual core mac and ran the usual checks for bad data and outliers. I reset my max records to 10 million to handle the load. It’s a pretty clean file and SPSS didn’t crash. You can use R or Python if you wish. I’m not religious about toolsets.

I used two aggregate functions to collapse the file – both by users and by links.

I believe that there is a some kind of Reddit rate limiting in that we can’t collect more than 1000 upvotes or 1000 downvotes for certain users. This generates bias when examining the very top quintile of the segmentation and underestimates the means. I could have treated this set of users as outliers from the beginning and ran through the analysis. The choice not to is about informing the reader about problems, as well as to close the loop of the 7,000,000 votes observed.

I could have also excluded any username where upvotes = 1000 OR downvotes = 1000 and excluded them under the banner of ‘outliers’. I could have also excluded them, then calculate the exponential decay equation unique to this sample, and then apply that equation for reinsertion of the missing cases. That’s not the story I wanted to tell.

I ran the segmentation in such a way to minimize that limitation on the core insight or story. It generates a very large problems for those who want to create predictive algorithms for high-intensity Redditors. It’s my intent that stating this outright, and being really transparent about, will help people the understand that just because there’s a generous API, it doesn’t mean that everything can be understood.

I reject the premise that the whole data set must be rejected because of the bias imposed by the API. In fact, far more good is done by trying to salvage data from the bottom four quintiles, being transparent about the impact of the Power Paula’s, than by ignoring the whole dataset.

The implications of exponential decay / power law distributions in social analytics is obvious to the ~150 cohorts  who do this exclusively for a living, and to the ~50,000 CRM people who don’t deal with social exclusively. I never made a claim that I discovered anything new. And I make no claim that there’s anything novel, or even remotely interesting to you good people.

Whereas I could have used far more sophisticated methods to do the analysis, I didn’t. I’m hammering on histograms to demonstrate that looking at the data is still important, and, to tell a story. We should always be visualizing the data to really understand it.

Statistics don’t lie. Statistics are indifferent.

You’re invited to explore the data as well. Or, better yet, download a corpus yourself using the Reddit API and post it up. I could really use new datasets, especially if the 1000 rate limit has been solved.

I’m thanking Reddit for making so many API’s publicly exposed and enabling this sort of analysis and exploration. Thank you.

One thought on “Methodology for Who’s Downvoting You On Reddit

Comments are closed.