VectorWise - SmartData Collective

I was fortunate enough to speak with Marcin Zukowski earlier about VectorWise. If you missed it, VectorWise came out of stealth mode a day or two ago. The have announced a joint partnership with Ingres and essentially are claiming impressive analytic RDBMS performance gains on conventional hardware.

To start with, a key message that I think needs to be communicated here is that this is not a product announcement. Ingres and VectorWise have announced a partnership in which they of course plan to build products together, today those products are still in the works.

VectorWise is a spin out of CWI based on research that was undertaken by Marcin and others, research that centered on MonetDB. Explaining the essence of VectorWise is difficult because it is largely internal DBMS data storage & processing logic, but I will have a go.

The modern RDBMS is based around design principles that stem from general purpose OLTP roots and historical hardware architectures (this is partially true even for some of the newest analytic platforms). These design principles in a nutshell focus on the fact that disk is slow & CPU is fast. Data is seeked or partially scanned off disk and cached. Row-by-row (tuple-by-tuple) operators process that data, passing the outcome of each operator to the next as part of a queries execution plan until ultimately producing the result.

Traditionally I/O is the main bottleneck, so to make the database faster you add more I/O bandwidth. Today, disk requirements may be up to 100x the actual capacity needs, so many disks are necessary to achieve the I/O bandwidth to provide performance for an analytical RDBMS implementation. Even though the RBDMS’s may parallelize query operators across cores, this typically works by partitioning data between cores, yet each is still processing on a tuple-by-tuple basis.

Conventional wisdom? Well maybe. You see disk is only really “slow” when it is doing random seeks. Give a disk something sequential to do on the other hand and things are very different. Modern disks are able to sequentially scan in the range of 150MB per second. An array of 10 disks should therefore be able to return sequentially read data in the range of 1GB per second.

When it comes to databases, column based storage has been found to effectively structure data for a) high levels of compression and b) sequential access. VectorWise makes use of both of these technologies to help it achieve high levels of sequential I/O. The problem now however is that disk may no longer the bottleneck. While we can get 1GB a second sequentially off disk relatively easily & cheaply, processing tuple-by-tuple at this rate is very difficult. As it turns out, a RDBMS’s may only achieve a data processing rate of 50MB a second per CPU core. This makes the CPU processing limitations a big bottleneck for analytics data sets, assuming the above figures we would need over 20 cores to keep up with 10 disks (and of course CPU cores don’t scalability linearly).

If we step out of the database world for the moment into the world of high end computer games, or high end scientific processing, we find their use of current CPU technology is much more advanced than what we are used to. They are using new CPU extensions (MMX, SSE, SS2, Prescott etc) to parallize & pipeline computation within a CPU’s core meaning they are processing orders of magnitude more instructions per core that what a traditional RDBMS typically has been able to. The exact details are too low level to discuss here (many of the research papers are available online) but it is fair to say, modern CPU architectures contain advanced features that to date haven’t effectively been exploited by database vendors.

Enter VectorWise. Their aim is to marry storage technologies which allow high levels of sequential I/O to occur with query processing logic which is designed for modern CPU architectures. Rather than process tuple-by-tuple they are processing “vectors”, groups of tuples, leveraging modern CPU extensions and high levels of on-chip cache to allow the CPU to carry out higher data processing throughput. The result is instead of the 50MB a second in a tuple-by-tuple approach, VectorWise are able to achieve processing rates in the range of 500Mb-1GB a second per core in some situations. This means processing rates of 8GB a second or more could be possible with relatively low end hardware.

“In some situations” is the key point to stress here, this obviously isn’t a blanket gain that applies to all analytic data sets, workloads and query requirements. Just what those situations are will be the key to their technologies success, how well it actually applies to real world data sets and queries. I wouldn’t expect to see too many specific examples on this until a product beta appears. But the theory is VectorWise can offer high levels of processing capabilities with existing mainstream hardware. At this point VectorWise isn’t even focusing on MPP instead they are single node focused. If their scalability claims pan out you can imagine how this could allow a single node solution to be competitive with existing low to mid scale MPP solutions that are based on a more conventional query processing architecture.

This isn’t VectorWise’s only trick up their sleeve. They are also are leveraging research around column based storage, compression, piggy-backed (shared) scans and so on. Much of the research that has been adopted by VectorWise is referenced from their web site.

So VectorWise have impressive technology, so why then partner with Ingres rather than a larger vendor (or going at it alone)? Marcin offers a few reasons. Firstly, as academics they feel strongly that open source is cool so this path was greatly preferred over a relationship with a non-open vendor. Secondly Ingres will allow them to deliver their technology in an uncompromised fashion. Marcin mentioned that if they had partnered with one of the big three vendors, that vendors existing product strategies and investments would have likely meant their ideas could have only been implemented in partial form. Ingres on the other hand is going to allow them more of a green field. And of course, a partnership with Ingres makes sense from a go to market perspective as Ingres already has a worldwide reputation, a global customer base, sales & marketing capabilities etc.

Marcin confirmed that Ingres have an exclusive license to their technology, and first option to acquire them for a certain period of time. This allows Ingres to really invest in the relationship without the fear of the carpet being pulled out from under them.

VectorWise clearly are applying innovative research to analytical RBDMS requirements. But as interesting as the technology sounds, the proof in the pudding will be how well these design principals translate to real-world analytical processing requirements in mainstream product form. This remains to be seen, but Ingres and their community clearly has high hopes.

VectorWise is clearly differentiated when comparison with a traditional mainstream RDBMS running on mainstream hardware. However in this current market we have lots of different approaches to the problems described. Kickfire for example use their own SQL Chip processor to increase data processing rates and other appliance vendors are using FPGAs etc for similar purposes. The comparison of these different approaches and the relative effectiveness of each approach still need to be examined, however a mainstream hardware approach has obvious benefits.

Link to original postInnovations in information management

You Might also Like

Intel’s Next Generation Chip Architecture

AI the Perfect Solution to the Identity Fraud Epidemic

Can New AI Tools Help Make the Stock Market Eco-Friendlier?

The cost of making sure