By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Analytics at Twitter
Share
Notification Show More
Latest News
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
ai in omnichannel marketing
AI is Driving Huge Changes in Omnichannel Marketing
Artificial Intelligence
ai for small business tax planning
Maximize Tax Deductions as a Business Owner with AI
Artificial Intelligence
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Analytics at Twitter
Data MiningPredictive Analytics

Analytics at Twitter

TonyBain
Last updated: 2009/11/25 at 7:35 AM
TonyBain
10 Min Read
SHARE

Twitter

Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides itself on striving for engineering excellence, the creation and contribution to new technologies, and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.

Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last six months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to .. …

More Read

using hadoop for email marketing scalability

Scalability-focused Email Marketing Solutions that Incorporate Hadoop

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets
5 Innovative Ways To Reduce Instagram Data Usage
How Big Data and Hadoop Training Programs Can Make a Big Difference
Big Data New Age: Hadoop vs Spark


Twitter

Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides itself on striving for engineering excellence, the creation and contribution to new technologies, and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.

Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last six months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood. When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data. Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”. 

Hadoop is a distributed shared-nothing cluster which locates data throughout the cluster using a virtualized file system. What has made Hadoop particularly popular for large scale deployment is the comparative ease of writing distributed functions through a process known as map/reduce. Map/reduce hides much of the complexity of running distributed functions, even when running over a very large numbers of nodes. This allows the developer to focus on their “application logic” rather than worrying about specifics of the execution process (Hadoop handles distribution of execution, node failures, etc). But in saying this, expressing complicated application logic directly in map/reduce functions can become quite laborious as many pipelined map/reduce functions may be required to take raw data through to a useful processed result. Because of this complexity several higher level scripting languages have appeared to abstract this.

Twitter

Pig is one such scripting language for Hadoop. Pig takes the developers requirement expressed in the script and produces the underlying map-reduce jobs that are executed on Hadoop. This abstraction is incredibly important as without it the complexity of expressing difficult analytical ‘queries’ directly in map/reduce would be highly time consuming and error prone.  This can be thought of as being similar to the way SQL is a higher level abstraction language that hides all the query plan routines (written in C) that operate on the data in a traditional RDBMS. Of course abstraction provides increased efficiency in creating analytical routines, but comes at a performance cost. Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.

Ok, so why use Hadoop and Pig instead of more traditional approach like an MPP RDBMS? Kevin explained that there were a few reasons for this. Firstly Twitter, like many Web 2.0 companies, is committed to open source and likes to use software that has a low entry cost but also allows them to contribute to the code base. Kevin mentioned that Twitter did look at some of the open source MPP RDBMS platforms but were less than convinced of their ability to scale to meet their needs at the time. And the second reason is exactly that, scale. Twitter is understandably coy on their exact numbers, but they have hundreds of Terabytes of data (but less than a Petabyte) and one could assume that to get reasonable performance they are running Hapdoop on a few dozen nodes (this is a guess, Twitter didn’t say). As they grow analytics will become more important to their business, this may expand to hundreds (or thousands) of nodes. A “few hundred” nodes is right on the upper limit on what is possible today with the world’s most advanced MPP RBDMS’s. Hapdoop clusters, on the other hand, grow well into the hundreds and even the thousands of nodes (e.g., at Google and Facebook).

So Hadoop was the platform choice, but why Pig? There are other “analytical” scripting languages that sit over Hadoop, notably Hive which was popularized by Facebook (Pig was popularized by Yahoo). On discussing the merits of Pig vs Hive it became apparent that Hive was more in tune with a traditional approach (“database like”). Hive requires data to be mapped to a given structure and the queries (using a SQL like derivative) are submitted against that schema. Pig on the other hand is less prescriptive in terms of schema and individual queries can define the structure of the data for that execution. In addition, Pig is more of a “procedural” language allowing the complicated data flow process to be more easily controlled and understood by the developers.

So, as mentioned, Hadoop is a batch-based job processing platform. Jobs (in this case map/reduce jobs generated from the Pig queries) are submitted and results are returned sometime in the future. Exactly when in the future varies from a few minutes (e.g., they run jobs hourly which only take a few minutes to run) through to many hours for jobs that run over much larger sets of data. This leaves a gap in “near real-time” analytics between the lightweight queries they can run on the transactional system and the more intense Hadoop based analytics. This has been a space that Twitter has been investigating solutions to fill. This space will be used for things like improved abuse detection, issue analysis and so on. Twitter is currently considering their data platform options here including Cassandra, HBase and may even decide to use a closed sourced MPP solution to fill this need (I can’t say what, sorry) due to the lack of suitable open source MPP alternatives.

For more technical info on Twitters use of Hadoop and Pig you can check out Kevin’s slide deck from the recent NoSQL East conference.

Reblog this post [with Zemanta]


Link to original post

TAGGED: hadoop, twitter
TonyBain November 25, 2009
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
ai in omnichannel marketing
AI is Driving Huge Changes in Omnichannel Marketing
Artificial Intelligence
ai for small business tax planning
Maximize Tax Deductions as a Business Owner with AI
Artificial Intelligence
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

using hadoop for email marketing scalability
Hadoop

Scalability-focused Email Marketing Solutions that Incorporate Hadoop

6 Min Read
hadoop data mining tools
Software

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets

6 Min Read
Instagram data usage tips
Big Data

5 Innovative Ways To Reduce Instagram Data Usage

5 Min Read
big data and Hadoop guide
AnalyticsBig DataExclusiveHadoopSoftware

How Big Data and Hadoop Training Programs Can Make a Big Difference

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?