The “Garbage in, garbage out” principle has even more magnitude in the case of Big Data. Estimates show a 650% growth in the volume of information that will be available over the next 5 years. Therefore organizations need a way to clean their info or risk getting buried in meaningless, valueless strings. One in five developers highlight the quality of data as their biggest problem when building new apps, and the average company loses over $10 million due to data quality. These problems are inherited from a time when data meant only neat tables, accessible via SQL. Now, anything from payroll to CCTV and online reviews qualify as data and needs to be tested before being used. But where to start?
Challenges of Big Data testing
It’s best to take the bull by the horns and list the sensitive areas and the subsequent problems before designing an attack strategy. When it comes to Big Data, challenges emerge from the very structure of the data, as described by the 4Vs, the lack of knowledge and the technical training of the testers as well as from the costs associated with such an initiative.
The 4Vs of Big Data
Not all large sets of data are truly Big Data. If it is only a matter of volume, it’s just a high load. When the size is given by velocity (high frequency), variety (numbers, text, images, and audio) and more so by veracity, it is a genuinely interesting situation. Testing for volume should eliminate bottlenecks and create parallel delivery. Velocity testing is necessary to counteract attacks that only last a few seconds and to prevent overloading the system with irrelevant data. Testing the accuracy of various information is the biggest challenge and requires continuous transformation and communication between Big Data and SQL managed systems through Hadoop.
Not all QA teams are comfortable with automation, let alone the problems posed by Big Data. The different formats require a different initial set-up. Only the correct data format validation requires more time and energy than an entire software based on conditional programming. The speed of development and the size of data does not allow a step by step approach. Agile is the only way, and it is not a mindset that has been adopted by all QA experts.
Since there are no standard methodologies for Big Data testing as yet, the total length of a project depends greatly on the expertise of the team, which, as previously mentioned is another variable. The only way to cut costs is to create sprints of testing. Proper validation should reduce total testing times. This aspect is also related to the architecture, as storage and memory can increase costs exponentially. When working with Big Data software testing company A1QA recommends implementing Agile and Scrum to keep spending under tight control and scale requirements to fit the budget dynamically.
Big Data Testing Aspects
When it comes to Big Data, testing means successfully processing petabytes (PB) of data, and there are functional and non-functional steps necessary to ensure end-to-end quality.
- Data validation (structured & unstructured), also called pre-Hadoop – check the accuracy of loaded data compared to the original source, file partitioning, data synchronization, determining a data set for testing purposes.
- Map Reduce Process Validation – compress data into manageable packages and validate outputs compared to inputs.
- ETL (extract, transform, load) Process Validation – validate transformation rules, check for data accuracy and eliminate corrupted entries, also confirms reports. This is usually a good place to include process automation, since data at this point is correct, complete and ready to be moved in the warehouse, so few exceptions are possible.
Since Big Data is too large to check each entry, the way around this problem is to use “checksum” logic. Instead of looking at individual entries, you look at combined products of them and decide on the validity of these aggregated products. To make a simple parallel, instead of checking the entire payroll, you just look at the amount paid and the inbox for employees’ complaints. If the amount is the same as the sum of the salaries and there is no complaint, you can conclude the payment was made correctly.
To make sure the system is running smoothly you also should check the meta-data. These are the non-functional tests:
- Performance testing – deals with the metrics such as response time, speed, memory utilization, storage utilization, virtual machine performance and so on.
- Failover testing – verifies how the process will perform in case some nodes are not reachable.
Differences from traditional testing
The testing process described is much different from a classic software testing approach. First, the introduction of unstructured data requires a rethinking of the validation strategies, so dull work is transformed into R&D. Sampling data for testing is no longer an option, but a challenge, to ensure representability of data for the entire batch.
Even the architecture of the testing environment is different. A browser window and an emulator software are no longer enough, Big Data can only be processed in Hadoop. The validation tools have also evolved from simple spreadsheets to dedicated tools like MapReduce, and they require dedicated specialists with extensive training.
The most important difference is that validation through manual testing alone is no longer possible anymore.