When Data Flows Faster Than It Can Be Processed
With big data come a few challenges: we've already mentioned the curse of big data. But what can we do when data flows faster than it can be processed?
Typically, this falls into two categories of problems:
No matter what processing, what algorithm is used, astronomical amounts of data keep piling up so fast that you need to delete some of it, a bigger proportion every day, before you can even look at it, not to mention analyze it even with the most rudimentary tools. An example of this is astronomical data used to detect new planets, new asteroids etc. It keeps coming faster, in larger amounts, than it can be processed on the cloud using massive parallelization. Maybe good sampling is a solution: carefully select which data to analyze, and which data to ignore, before even looking at the data. Or develop better compression algorithms so that one day, when we have more computing power, we can analyze all the data previously collected but not analyzed, and maybe in 2030, look at the hourly evolution of a far away supernova that took place in 2010 over a period of 20 years, but was undetected because the data was parked on a sleeping server.
Data is coming in very fast in very big amounts, but all of it can still be processed with modern, fast distributed, Map-reduced powered algorithms or some other techniques. The problem is that the data is so vast, the velocity so high, and sometimes the data unstructured, that it can only be processed by crude algorithms, resulting in bad side effects. Or at least, that's what your CEO thinks.
I will focus on this second category, particularly the last item: it is the most relevant to businesses. The types of situations that come to my mind include
- Credit card transaction scoring in real time
- Spam detection
- Pricing/discount optimization in real time (retail)
- Anti-missile technology (numerical optimization issue)
- High frequency trading
- Systems relying on real-time 3D streaming data (video, sound) such as auto-pilot (large planes flying in auto-pilot at low elevation over crowded urban skies),
- Facebook "likes", Twitter tweets, Yelp reviews, matching content with user (Google) etc.
Using crude algorithms results in:
- Too many false positives or false negatives and undetected fake reviews
- Fake Tweets can result in stock market collapses (see recent example) or fake press releases about the White House being attacked, yet these fake tweets failed to be detected in real time by Twitter
- Billions of undetected fake "likes" on Facebook, creating confusion for advertisers, and eventually a decline in ad spend; it also has a bad impact on users who eventually ignore the "likes", further reducing an already abysmal click-through rate and lowering revenue. Facebook will have to come with some new feature to replace the "likes", this new feature will eventually be abused again, the abuse not detected by machine learning algorithms (or lack of), perpetuating a cycle of micro-bubbles that must be permanently kept alive with new ideas to maintain revenue streams.
So what is the solution?
I believe that in many cases, there is too much reliance on crowd-sourcing, and reluctance in using sampling techniques because very few experts know how to get great samples out of big data. In some cases, you still need to come up with very granular predictions anyway (not summaries), for instance house prices for every single house (Zillow) or weather forecasts for each zip-code. Yet even in these cases, good sampling would help.
Many reviews, "likes", tweets or spam flags are made by users (sometimes Botnet operators, sometimes business competitors) with a bad intent, gaming the system on a large scale. Greed is also part of the problem: if fake "likes' generate revenue for Facebook and advertisers don't notice, let's feed these advertisers (at least the small guys) with more fake "likes", because (we think) that we don't have enough relevant traffic to serve all advertisers, and we want good traffic to go to the big guys. When the small guys notice, either discontinue the practice (come up with new idea) or wait till you get hit by a class action lawsuit: $90 million is peanuts for Facebook, and that's what Google and others settled for when they were hit by a class action lawsuit for delivering fake traffic.
Yet there is a solution that benefits everyone (users, companies such as Google, Amazon, Netflix, Facebook or Twitter, and clients): better use of data science. I'm not talking about developing sophisticated, expensive statistical technology, but just simply switching to using better metrics, different weights (e.g. put less emphasis on data resulting from crowd-sourcing), better linkage analysis, association rules to detect collusion, Botnets and low frequency yet large-scale fraudsters, and better frequently updated look-up tables (white lists of IP addresses). All this without slowing down existing algorithms.
Here's one example for social network data: instead of counting the number of "likes" (not all "likes" are created equal), do:
- Look at users that produce hundreds of "likes" a day, and "likes" arising from IP addresses that are flagged by Project Honeypot or Stop Forum Spam. Don't put all your trust in these two websites - they also, at least partially, rely on crowd-sourcing (users reporting spam) and are thus subject to false positives and abuse.
- If two hundred "likes" result in 0 comment or 50 versions of the same "this is great", "great post" comment, then clearly we are dealing with a case of fake "likes". The advertiser should not be charged, and the traffic source identified and discontinued.
- Look at buckets of traffic with high proportion of low quality users coming back too frequently with two "likes" a day, with red flags such as no referral domain or tons of obscure domains, IP address frequently not resolving to a domain, traffic coming from a sub-affiliate, etc.
- Also metric such as unique users are much more robust (more difficult to fake) than page views - but you still need to detect fake users.
- Use a simple, robust, fast, data-driven (rather than model-driven) algorithm to score "likes" and comments, such as hidden decision trees. You can even compute confidence intervals for scores without any statistical modeling.
- Create or improve ad relevancy algorithms with simple taxonomy concepts. This applies to many all applications where relevancy is critical, not just ad delivery. It also means better algorithms to detect influential people (in one example you had to be member of PeerIndex to get a good score), to better detect plagiarism, to better detect friends, to better score job applicants, to improve attribution systems (advertising mix optimization including long-term components in statistical models) and the list goes on and on.
The opposite problem also exists:
When you can analyze data (usually in real time with automated algorithms) and extract insights faster than it can be delivered to and digested by the end user (executives and decision makers). It's bad when the decision makers get flooded with tons of un-prioritized reports.
Sometimes, this type of situation arises with machine-talking-to-machine, e.g. eBay automatically pricing million of bid keywords every day and feeding these prices automatically to Google Adwords via the Google API.
Similarly, I produced cluster simulations faster than a real-time streaming device can deliver them to a viewer: I called it FRT for faster than real time.