The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing.

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ‘07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

Th…

Contents

Detecting Drive-By Exploits from Log Files Detecting Drive-By Exploits from Log Files TeraSort Benchmark Generating Data for MalStone Using MalGen Using MalStone to Study Design Tradeoffs

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

MalStone is a stylized analytic computation of a type that is common in data intensive computing.

Detecting Drive-By Exploits from Log Files

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:

   | User ID | Compromise Flag

Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.

We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.

The MalStone benchmarks use records of the following form:

   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.

In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.

I’ll update this post shortly with a technical report describing MalStone.

TeraSort Benchmark

One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.

In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.

Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.

Generating Data for MalStone Using MalGen

We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).

Using MalStone to Study Design Tradeoffs

Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212,
and 1 Gb/s network interface cards.

We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).

MalStone A
Hadoop MapReduce	454m 13s
Hadoop Streams/Python	87m 29s
Sector/Sphere UDFs	33m 40s
MalStone B
Hadoop MapReduce	840m 50s
Hadoop Streams/Python	142m 32s
Sector/Sphere UDFs	43m 44s

If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.

Disclaimer: I am involved in the development of Sector.

There are related posts in my blog: blog.rgrossman.com