The Technology of VoltDB
VoltDB is a company fielding a technology designed by DBMS pioneer Mike Stonebraker. It is designed to address challenges of performance limitations in existing systems, and also provides significant potential cost savings, giving it the virtuous position of having more functionality at a lower cost.
In conversations with Stonebraker I learned a bit more about the VoltDB approach and would like to share a bit of context here.
First, for background, consider that most transaction-focused databases were designed decades ago. The need for new approaches has led to the movement many call “Big Data” and also gave rise to a group of efforts called the “No SQL” approach. Mike underscored that the benefits of No SQL for some problems is clear, but for transactions SQL is still key. What is needed, he says, is a NewSQL: something designed fresh to take advantage of new compute capabilities. The result is VoltDB, a re-engineered approach to SQL.
VoltDB considers themselves at “The NewSQL database for high velocity applications.” It is an in-memory database, which means it primarily relies on main memory for data storage. Main memory is the memory directly accessible by the CPU, it is not secondary storage like hard disks or offline storage like tapes. Being based in main memory provides many speed benefits, including the fact that main memory is faster. Main memory databases are also faster since internal optimization algorithms are simpler.
VoltDB’s design provides the benefits of ACID (atomicity, consistency, insolation, durability) for very high transactions and changes. The design also provides the benefits of a “shared nothing” architecture, which lets each database node operate in an independent and self-sufficient way. There need be no single point of contention across the system. Shared nothing architectures are known for their scalability. They scale by simply adding nodes.
VoltDB is a relational database that provides SQL access from within pre-compiled Java stored procedures interspersed within SQL. Since stored procedures can be the unit of transaction time and compute power is saved on their execution. Data does not have to make the round trip between SQL statements. Stored procedures can be executed serially and to completion in a single thread without locking or latching. Since data is in memory and local to the partition a stored procedure can execute in microseconds.
The results of this new design are much faster execution, and more of the computing power being focused on results vice admin overhead. In benchmark testing of old systems, up to 90% of the computer effort is spent in things like managing the buffer pool, concurrency control, record locks, crash and trash recovery, managing multiple access threads etc. This overhead is eliminated with the newly engineered approach of NewSQL.
Here is a bit more from the VoltDB website:
VoltDB is a blazingly fast relational database system. It is specifically designed for modern software applications that are pushed beyond their limits by high velocity data sources. This new generation of systems – real-time feeds, machine-generated data, micro-transactions, high performance content serving – requires database throughput that can reach millions of operations per second. What’s more, the applications that use this data must be able to scale on demand, provide flawless fault tolerance and give real-time visibility into the data that drives business value.
The volume and velocity of data are exploding, fueled by social applications, sensor automation, mobile networking, and other data-intensive forces. Moore’s Law signals massive data tier scale-outs as networks and servers become faster and less expensive. Enabling that scale-out is a new generation of relational DBMSs, led by VoltDB, designed to exploit networked and virtualized computing environments. VoltDB provides the throughput, scale and accuracy needed to handle high velocity applications.
I asked Mike about use cases for this new approach. In his view, any organization which has racks of old style SQL should consider the cost savings and performance benefits of this approach, and I think he is right. When you get the benefits of needing less hardware, spending less on power, spending less on cooling data centers, but getting higher performance on transaction databases, that is very virtuous.
I also asked Mike about how they fit into an architecture where non-SQL-type analytics must be done rapidly over big data. He immediately pointed out how they fit with Hadoop and the Cloudera’s Distribution Including Hadoop (CDH), which has SQL-to-Hadoop integration technology (Apache Sqoop) built in.approach. Organizations can selectively stream high velocity data from a VoltDB cluster into Hadoop’s Distributed File System leveraging
The following is from a recent VoltDB press release on that topic:
VoltDB Announces Enterprise-grade Hadoop Integration
Billerica, Mass., June 22, 2011 – VoltDB, a leading provider of high-velocity data management systems, today announced the release of VoltDB Integration for Hadoop. The new product functionality, available in VoltDB Enterprise Edition, allows organizations to selectively stream high velocity data from a VoltDB cluster into Hadoop’s native HDFS file system by leveraging Cloudera’s Distribution Including Apache Hadoop (CDH), which has SQL-to-Hadoop integration technology, Apache Sqoop, built in.
“The term ‘big data’ is being applied to a diverse set of data storage and processing problems related to the growing volume, variety and velocity of data and the desire of organizations to store and process data sets in their totality,” said Matt Aslett, senior analyst, enterprise software, The 451 Group. “Choosing the right tool for the job is crucial: high velocity data requires an engine that offers fast throughput and real-time visibility; high volume data requires a platform that can expose insights in massive data sets. Integration between VoltDB and CDH will help organizations to combine two special purpose engines to solve increasingly complex data management problems.”
Volume, Velocity and Variety
The volume, velocity and variety of data are exploding, fueled by social applications, sensor automation, mobile networking, and other data intensive forces. Organizations are increasingly turning to specialized, task-specific data management solutions. Leading examples include VoltDB, which is designed to process high velocity data in real time, and Cloudera’s Distribution Including Apache Hadoop (CDH), which provides organizations with a reliable and elastic infrastructure for data processing and deep analytics. VoltDB’s Integration for Hadoop allows customers to rapidly move high velocity data from VoltDB to CDH for long term storage and analysis.
“Customers across a wide variety of industries, from retail and web services to government and telecommunications, are using Cloudera’s Distribution Including Apache Hadoop to identify new value from a wide variety of data sources and then process that data into new product features for their end users,” said Ed Albanese, Head of Business Development for Cloudera. “It’s exciting that companies using CDH are now able to collect data from VoltDB – a next-generation, real-time database, process that data into high value insights and then deliver the results back to VoltDB for real-time consumption. This integration introduces new opportunities for processing and delivering information derived from a previously untapped class of data.”
VoltDB Integration for Hadoop is designed specifically to handle the widest variety of customer deployment scenarios including end-user applications, site-based OEM installations and Cloud-based deployments. It combines VoltDB’s enterprise-grade export environment with Apache Sqoop, a Cloudera-sponsored solution for integrating relational databases with Hadoop infrastructures, and delivers the following capabilities:
- Simple, fast set-up. Establishing integration between VoltDB and a Hadoop installation is fast and easy. A user identifies which VoltDB data will be exported to Hadoop, configures the VoltDB export client with the location of Hadoop, the location of a VoltDB cluster, Sqoop options such as output formatting, and other installation-specific instructions (e.g., frequency of import). The VoltDB export client automatically manages periodic Sqoop jobs based on this configuration. The entire set-up process can be completed in about 15 minutes.
- Loosely-coupled, push-pull operation. VoltDB automatically pushes copies of export data, in real-time, to the VoltDB export client, which in turn automatically queues that data. The Sqoop receiver then pulls data from the VoltDB export client and imports that data into HDFS on whatever frequency and in whatever amounts the user has defined. VoltDB’s export client manages its data buffer in a way that eliminates possible “impedance mismatches” (i.e., VoltDB exporting data faster than Sqoop imports that data).
- Automatic overflow management. VoltDB’s export client also automatically writes overflow data to disk to optimize memory utilization. This feature protects against large-scale overflows that could occur if the Sqoop receiver terminates, and allows export data to be retained across sessions if the VoltDB database is stopped.
“Big Data applications come with a complex combination of operational and analytical challenges,” said VoltDB CEO Scott Jarr. “In response, many organizations are evolving rapidly toward specialized database engines that must function in a co-ordinated way. Recognizing this need, VoltDB and Cloudera are working co-operatively to deliver high-powered product integrations that are easy to use, fast to deploy, and reliable to operate in production.”
With all the progress being made in the direction of NewSQL you can expect a bit of drama in the community. That is something I love about our field, folks are not afraid to voice opinions, and some of the giants have been rumbling about on this topic. For more reading on the drama see:
But perhaps more importantly, if you are designing changes to your organization’s data approaches, consider how VoltDB will fit in your approach. And stick on your path to leverage Cloudera as well, of course.