Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: When Data Flows Faster Than It Can Be Processed
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > When Data Flows Faster Than It Can Be Processed
AnalyticsBest PracticesBig DataBusiness IntelligenceData ManagementDecision ManagementMapReduceMarket ResearchMarketingSocial DataStatistics

When Data Flows Faster Than It Can Be Processed

vincentg64
vincentg64
10 Min Read
SHARE

With big data come a few challenges: we’ve already mentioned the curse of big data. But what can we do when data flows faster than it can be processed?

With big data come a few challenges: we’ve already mentioned the curse of big data. But what can we do when data flows faster than it can be processed?

big data solutions

Typically, this falls into two categories of problems:

More Read

Value of Private Cloud Hosting Infographic
The True Value of the Private Cloud [INFOGRAPHIC]
AI-Savvy Hackers Threaten Businesses With 20% Ransomware Increase
Agile Data Warehousing
How People from Outside of the Tech Industry are Breaking into Data Science
Breaking Up (Cartoon)

First category:

No matter what processing, what algorithm is used, astronomical amounts of data keep piling up so fast that you need to delete some of it, a bigger proportion every day, before you can even look at it, not to mention analyze it even with the most rudimentary tools. An example of this is astronomical data used to detect new planets, new asteroids etc. It keeps coming faster, in larger amounts, than it can be processed on the cloud using massive parallelization. Maybe good sampling is a solution: carefully select which data to analyze, and which data to ignore, before even looking at the data. Or develop better compression algorithms so that one day, when we have more computing power, we can analyze all the data previously collected but not analyzed, and maybe in 2030, look at the hourly evolution of a far away supernova that took place in 2010 over a period of 20 years, but was undetected because the data was parked on a sleeping server.

Second category:

Data is coming in very fast in very big amounts, but all of it can still be processed with modern, fast distributed, Map-reduced powered algorithms or some other techniques. The problem is that the data is so vast, the velocity so high, and sometimes the data unstructured, that it can only be processed by crude algorithms, resulting in bad side effects. Or at least, that’s what your CEO thinks.

I will focus on this second category, particularly the last item: it is the most relevant to businesses. The types of situations that come to my mind include

  • Credit card transaction scoring in real time
  • Spam detection
  • Pricing/discount optimization in real time (retail)
  • Anti-missile technology (numerical optimization issue)
  • High frequency trading 
  • Systems relying on real-time 3D streaming data (video, sound) such as auto-pilot (large planes flying in auto-pilot at low elevation over crowded urban skies), 
  • Facebook “likes”, Twitter tweets, Yelp reviews, matching content with user (Google) etc.

Using crude algorithms results in:

  • Too many false positives or false negatives and undetected fake reviews 
  • Fake Tweets can result in stock market collapses (see recent example) or fake press releases about the White House being attacked, yet these fake tweets failed to be detected in real time by Twitter
  • Billions of undetected fake “likes” on Facebook, creating confusion for advertisers, and eventually a decline in ad spend; it also has a bad impact on users who eventually ignore the “likes”, further reducing an already abysmal click-through rate and lowering revenue. Facebook will have to come with some new feature to replace the “likes”, this new feature will eventually be abused again, the abuse not detected by machine learning algorithms (or lack of), perpetuating a cycle of micro-bubbles that must be permanently kept alive with new ideas to maintain revenue streams.

So what is the solution?

I believe that in many cases, there is too much reliance on crowd-sourcing, and reluctance in using sampling techniques because very few experts know how to get great samples out of big data. In some cases, you still need to come up with very granular predictions anyway (not summaries), for instance house prices for every single house (Zillow) or weather forecasts for each zip-code. Yet even in these cases, good sampling would help. 

Many reviews, “likes”, tweets or spam flags are made by users (sometimes Botnet operators, sometimes business competitors) with a bad intent, gaming the system on a large scale. Greed is also part of the problem: if fake “likes’ generate revenue for Facebook and advertisers don’t notice, let’s feed these advertisers (at least the small guys) with more fake “likes”, because (we think) that we don’t have enough relevant traffic to serve all advertisers, and we want good traffic to go to the big guys. When the small guys notice, either discontinue the practice (come up with new idea) or wait till you get hit by a class action lawsuit: $90 million is peanuts for Facebook, and that’s what Google and others settled for when they were hit by a class action lawsuit for delivering fake traffic.

Yet there is a solution that benefits everyone (users, companies such as Google, Amazon, Netflix, Facebook or Twitter, and clients): better use of data science. I’m not talking about developing sophisticated, expensive statistical technology, but just simply switching to using better metrics, different weights (e.g. put less emphasis on data resulting from crowd-sourcing), better linkage analysis, association rules to detect collusion, Botnets and low frequency yet large-scale fraudsters, and better frequently updated look-up tables (white lists of IP addresses). All this without slowing down existing algorithms.

Here’s one example for social network data: instead of counting the number of “likes” (not all “likes” are created equal), do:

  • Look at users that produce hundreds of “likes” a day, and “likes” arising from IP addresses that are flagged by Project Honeypot or Stop Forum Spam. Don’t put all your trust in these two websites – they also, at least partially, rely on crowd-sourcing (users reporting spam) and are thus subject to false positives and abuse. 
  • If two hundred “likes” result in 0 comment or 50 versions of the same “this is great”, “great post” comment, then clearly we are dealing with a case of fake “likes”. The advertiser should not be charged, and the traffic source identified and discontinued.
  • Look at buckets of traffic with high proportion of low quality users coming back too frequently with two “likes” a day, with red flags such as no referral domain or tons of obscure domains, IP address frequently not resolving to a domain, traffic coming from a sub-affiliate, etc.
  • Also metric such as unique users are much more robust (more difficult to fake) than page views – but you still need to detect fake users. 
  • Use a simple, robust, fast, data-driven (rather than model-driven) algorithm to score “likes” and comments, such as hidden decision trees. You can even compute confidence intervals for scores without any statistical modeling.
  • Create or improve ad relevancy algorithms with simple taxonomy concepts. This applies to many all applications where relevancy is critical, not just ad delivery. It also means better algorithms to detect influential people (in one example you had to be member of PeerIndex to get a good score), to better detect plagiarism, to better detect friends, to better score job applicants, to improve attribution systems (advertising mix optimization including long-term components in statistical models) and the list goes on and on.

The opposite problem also exists:

When you can analyze data (usually in real time with automated algorithms) and extract insights faster than it can be delivered to and digested by the end user (executives and decision makers). It’s bad when the decision makers get flooded with tons of un-prioritized reports.

Sometimes, this type of situation arises with machine-talking-to-machine, e.g. eBay automatically pricing million of bid keywords every day and feeding these prices automatically to Google Adwords via the Google API.

Similarly, I produced cluster simulations faster than a real-time streaming device can deliver them to a viewer: I called it FRT for faster than real time.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

CASB
Security

CASBs Help Cloud-Based Businesses Avoid Data Breaches

6 Min Read

The Promise of Big Data: How it will Impact Roles, Company Culture and the Industry in High Performance Environments

10 Min Read

Adventures in Data Profiling (Part 8)

14 Min Read
Image
Big Data

How Tech Trends Are Going to Affect Big Data Significantly [VIDEO]

1 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?