Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Analytics at Twitter
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Analytics at Twitter
Data MiningPredictive Analytics

Analytics at Twitter

TonyBain
TonyBain
10 Min Read
SHARE

Twitter

Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides itself on striving for engineering excellence, the creation and contribution to new technologies, and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.

Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last six months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to .. …

More Read

Behind the scenes of REvolution’s 64-bit Windows port of R
Data to the People!
Four Essentials for Enabling Pattern-Based Strategies
Quick notes
Teradata Podcasts on Data Mining And SNA


Twitter

Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides itself on striving for engineering excellence, the creation and contribution to new technologies, and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.

Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last six months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood. When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data. Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”. 

Hadoop is a distributed shared-nothing cluster which locates data throughout the cluster using a virtualized file system. What has made Hadoop particularly popular for large scale deployment is the comparative ease of writing distributed functions through a process known as map/reduce. Map/reduce hides much of the complexity of running distributed functions, even when running over a very large numbers of nodes. This allows the developer to focus on their “application logic” rather than worrying about specifics of the execution process (Hadoop handles distribution of execution, node failures, etc). But in saying this, expressing complicated application logic directly in map/reduce functions can become quite laborious as many pipelined map/reduce functions may be required to take raw data through to a useful processed result. Because of this complexity several higher level scripting languages have appeared to abstract this.

Twitter

Pig is one such scripting language for Hadoop. Pig takes the developers requirement expressed in the script and produces the underlying map-reduce jobs that are executed on Hadoop. This abstraction is incredibly important as without it the complexity of expressing difficult analytical ‘queries’ directly in map/reduce would be highly time consuming and error prone.  This can be thought of as being similar to the way SQL is a higher level abstraction language that hides all the query plan routines (written in C) that operate on the data in a traditional RDBMS. Of course abstraction provides increased efficiency in creating analytical routines, but comes at a performance cost. Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.

Ok, so why use Hadoop and Pig instead of more traditional approach like an MPP RDBMS? Kevin explained that there were a few reasons for this. Firstly Twitter, like many Web 2.0 companies, is committed to open source and likes to use software that has a low entry cost but also allows them to contribute to the code base. Kevin mentioned that Twitter did look at some of the open source MPP RDBMS platforms but were less than convinced of their ability to scale to meet their needs at the time. And the second reason is exactly that, scale. Twitter is understandably coy on their exact numbers, but they have hundreds of Terabytes of data (but less than a Petabyte) and one could assume that to get reasonable performance they are running Hapdoop on a few dozen nodes (this is a guess, Twitter didn’t say). As they grow analytics will become more important to their business, this may expand to hundreds (or thousands) of nodes. A “few hundred” nodes is right on the upper limit on what is possible today with the world’s most advanced MPP RBDMS’s. Hapdoop clusters, on the other hand, grow well into the hundreds and even the thousands of nodes (e.g., at Google and Facebook).

So Hadoop was the platform choice, but why Pig? There are other “analytical” scripting languages that sit over Hadoop, notably Hive which was popularized by Facebook (Pig was popularized by Yahoo). On discussing the merits of Pig vs Hive it became apparent that Hive was more in tune with a traditional approach (“database like”). Hive requires data to be mapped to a given structure and the queries (using a SQL like derivative) are submitted against that schema. Pig on the other hand is less prescriptive in terms of schema and individual queries can define the structure of the data for that execution. In addition, Pig is more of a “procedural” language allowing the complicated data flow process to be more easily controlled and understood by the developers.

So, as mentioned, Hadoop is a batch-based job processing platform. Jobs (in this case map/reduce jobs generated from the Pig queries) are submitted and results are returned sometime in the future. Exactly when in the future varies from a few minutes (e.g., they run jobs hourly which only take a few minutes to run) through to many hours for jobs that run over much larger sets of data. This leaves a gap in “near real-time” analytics between the lightweight queries they can run on the transactional system and the more intense Hadoop based analytics. This has been a space that Twitter has been investigating solutions to fill. This space will be used for things like improved abuse detection, issue analysis and so on. Twitter is currently considering their data platform options here including Cassandra, HBase and may even decide to use a closed sourced MPP solution to fill this need (I can’t say what, sorry) due to the lack of suitable open source MPP alternatives.

For more technical info on Twitters use of Hadoop and Pig you can check out Kevin’s slide deck from the recent NoSQL East conference.

Reblog this post [with Zemanta]


Link to original post

TAGGED:hadooptwitter
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

sales and data analytics
How Data Analytics Improves Lead Management and Sales Results
Analytics Big Data Exclusive
ai in marketing
How AI and Smart Platforms Improve Email Marketing
Artificial Intelligence Exclusive Marketing
AI Document Verification for Legal Firms: Importance & Top Tools
AI Document Verification for Legal Firms: Importance & Top Tools
Artificial Intelligence Exclusive
AI supply chain
AI Tools Are Strengthening Global Supply Chains
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

“Our Customers Don’t Use Stuff like Facebook and Twitter”

4 Min Read
Image
Analytics

Twitter Consistently Valuable for Analytics Initiatives

4 Min Read

How BI and Data Analytics Gurus Used Twitter in February

4 Min Read

Top Market Researchers on Twitter

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?