NoSQL Vs. RDBMS for Interactive Analytics: Leveraging the Right and Left Brain of Data
Comparing NoSQL and relational databases is lot like comparing the left and right sides of the brain. Too much focus on structural differences and attributes can overshadow the fact that we’re stuck with both sides of the brain and we need both to make the best use of sensory data.
Similarly, for organizations that are grappling with the question of semi-structured versus structured data, it’s very easy to oversimplify the advantages of NoSQL and relational databases (RDBMS).
RDBMS are the logical, reliable ‘left brain’ of companies that need to get information correct. This is done by guaranteeing a transaction’s ACID (Atomicity, Consistency, Isolation and Durability) property. Above all else, relational databases allow businesses to enforce the referential integrity of their data. For order entry, manufacturing systems or even medical applications where data needs to be complete and accurate, RDBMS are the gold standard.
However, for mobile, social and web 2.0 applications where the notion and number of transactions are significantly different from the relational world (e.g. I view a web page far more than I order something), where data attributes continually grow and application changes are frequent, relational databases become impractical. Whereas RDBMS would require a data model change every time, NoSQL databases are highly tolerant of such schema and application changes. NoSQL databases relax the ACID guarantee and choose to honor two out of three aspects of a transaction instead, as described by the CAP theorem—Consistency, Availability and Partition Tolerance. So with NoSQL, companies sacrifice the certainty of being right and gain the flexibility needed for the ‘right brain’ creativity of big data applications.
However, the data lifecycle goes beyond use in just transactional systems. In today’s Big Data world, it is as important to quickly and easily analyze this river of data as it is to capture and store it for operational use. Limiting the structured versus unstructured debate to just operational use cases ignores three key factors for downstream analytics: the tools, domain expertise and SQL compatibility gaps in the current NoSQL ecosystem, the challenges of exporting and warehousing volumes of this changing, semi-structured data and hidden costs of leveraging operational databases for complex, ad hoc analysis. Here is what organizations must additionally consider for their analytics needs as they evaluate NoSQL and RDBMS.
1. NoSQL is an emerging field
RDBMS have now been around since the 1980s while NoSQL has only grown over the last 6 or 7 years. If you were to compare relational databases to NoSQL feature by feature, relational databases would ‘win.’
RDBMS are still more stable, easy to use, secure and dependable than NoSQL. The mature SQL ecosystem surrounding them provides a wide selection of tools to massage, manipulate and analyze data as well as access to storage management, backups and other services for data management unavailable for NoSQL. RDBMS also ensure that typical business users can get the business insights and reports they expect using familiar SQL grammar and SQL Business Intelligence based tools.
While RDBMS champions like Oracle, IBM and Microsoft adhere to common SQL standards, NoSQL database systems and services are all under development. Behind the more popular names like MongoDB and HBase there are dozens of NoSQL technologies that have little in common. Today, NoSQL is semi-structured data’s Wild West.
2. Traditional data warehouses don’t get along with NoSQL
Despite the flexibility and scalability of NoSQL, the lack of standardization hurts. To get the freshest data from a NoSQL database into a traditional data warehouse, you have to Export and convert (or Translate) semi-structured data (JSON documents, key-values, etc) into a structured format that can be Loaded into the structured data formats prevalent in the relational world. This E-T-L is cumbersome and not resilient to data changes, which are frequent in the NoSQL world. This means that every time you add or change attributes in your application and store them in a NoSQL database, you have to change the ETL process before transferring changed data into your data warehouse.
Data warehouses were modeled after the relational databases, so they’re poorly equipped to handle the new types of data such as geo location or network address in their native form.
These new web-friendly data types can be converted to a generic data type like text during the ETL process, but the data will shed its analytical richness. For example, if you store IP addresses in text or numeric fields, you must first convert back to the original value at analysis time in order to derive unique information like the user’s location. This can be expensive and slow over billions of records. Moreover, business users can’t use their analytical tools for interactive data exploration within the data warehouse until the data undergoes this time-consuming makeover, so insights are delayed.
3. RDBMS and Traditional Data Warehouse costs grow prohibitively with data volume
Today’s high traffic web and mobile apps can generate terabytes of data in a day. Along with the contemporary penchant for keeping all data forever, even a midsized company could easily end up with 10s to 100s of terabytes of data on their hands in short order. The relational world wasn’t designed to deal with this kind of volume at a cost-effective price. The typical costs for a traditional data warehouse, including specialized hardware and software, run in the $10K-$20K per terabyte range, all payable upfront. It’s enough to turn a CFO pale.
Not so with NoSQL databases. With their open source software model, ability to run on commodity hardware and ability to scale horizontally with additional capacity over time, NoSQL databases significantly lower the costs of entry.
4. There are hidden costs with NoSQL
But there are added downstream costs to consider, especially for analytics. MongoDB, for example, is great for fast storage and retrieval of JSON (Java Script Object Notation) data, but won’t work for joins across two sets of data (collections in Mongo) or for interactive, ad hoc queries with multiple values, conditions and ranges where you don’t know upfront the arguments to use. Additionally, as your data grows, the cost of adding and managing more shards, indexes and memory to solve for both operational and analytic speed can take up significant time and resources. And remember, the surrounding ecosystem is still developing, so there is a dearth of management tools, knowledge and support to help bring operational costs down.
Furthermore, NoSQL’s use of procedural languages for querying, rather than declarative SQL, restricts their use to developers and programmers. It’s like taking all your business users and shipping them off to a foreign land where they don’t speak the language, rendering them cranky and reliant on a few, costly domain experts who “speak the language” to answer their questions.
Nevertheless, it is an exciting time for the data world as we see the dominance of relational systems being challenged by newer NoSQL technologies. As the very nature of data changes from static, well structured and predictable to shape-shifting in format over time, new strategies and paradigms will be needed for warehousing and analyzing this data if we wish to derive insights from it at web speed.
Soren is the Chief of Cloud Services for BitYota a Data Warehouse Services (DWS) provider based in Silicon Valley. His early tech experience was to build control systems for rockets for the European Space Agency. He went on to build real-time systems at British Telecom and Kabira/Tibco. He was the CTO for Whereonearth whose technology is the basis of Yahoo's geo-targetting platform. At Yahoo! ...
Other Posts by Soren Riise
The moderated business community for business intelligence, predictive analytics, and data professionals.