Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing
Uncategorized

The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing

Editor SDC
Editor SDC
9 Min Read
SHARE
The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing.

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ‘07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

Th…

Contents
Detecting Drive-By Exploits from Log FilesDetecting Drive-By Exploits from Log FilesTeraSort BenchmarkGenerating Data for MalStone Using MalGenUsing MalStone to Study Design Tradeoffs

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing.

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ‘07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:

   | User ID | Compromise Flag

Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.

We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.

The MalStone benchmarks use records of the following form:

   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.

In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.

I’ll update this post shortly with a technical report describing MalStone.

TeraSort Benchmark

One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.

In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.

Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.

Generating Data for MalStone Using MalGen

We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).

Using MalStone to Study Design Tradeoffs

Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212,
and 1 Gb/s network interface cards.

We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).

MalStone A
Hadoop MapReduce454m 13s
Hadoop Streams/Python87m 29s
Sector/Sphere UDFs33m 40s
MalStone B
Hadoop MapReduce840m 50s
Hadoop Streams/Python142m 32s
Sector/Sphere UDFs43m 44s

If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.

Disclaimer: I am involved in the development of Sector.

There are related posts in my blog: blog.rgrossman.com

 

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Data Mining Book Review: Applied Predictive Analytics

2 Min Read

Cisco Announces “Significant Innovations” in its Unified Computer Servers Exclusively for Data Centers

3 Min Read

Jason Adams Explains TunkRank

1 Min Read

SOA doesn’t just integrate, it ‘dis-integrates’

1 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?