Approaches to Big Data Visualization
We can expect the data visualization process to continue to evolve perhaps as more of a mixture of art and science rather than numbers crunching technology.
Data visualization is when you manually or otherwise organize and display data in a pictorial or graphic format in an attempt to enable your audience to:
- See the results of your analysis efforts more clearly
- Simplify the complexities within the data you are using
- Understand and grasp a point that you are using the data to make
Not a new concept
This concept of using pictures–typography, color, contrast, and shape–to communicate or understand data is not new and has been around for literally centuries, from the manual creation of maps and graphs in the 17th century to the invention of the pie chart in the early 1800s.
Today, computers can be used to process large amounts of data lightning fast to make visualizations tremendously more valuable. Going forward, we can expect the data visualization process to continue to evolve, perhaps as more of a mixture of art and science rather than a numbers crunching technology.
An exciting example of the data visualization evolutionary process is how the industry has moved data visualizations past the process of generating and publishing charts and graphs for an audience to review and deliberate on to now having set up an expectation for interactive visualizations.
With interactive visualization, we can take the concept of data visualization much, much further by using technology to allow the audience to interact with the data; giving the user the self-service ability to drill down into the generated pictures, charts, and graphs (to access more or specific details), interactively in real time (or near real time) to change what data is displayed (perhaps a different time frame or event) and how it’s processed and/or presented (maybe select a bar graph rather than a pie chart).
This allows visualizations to be much more effective and personalized.
In Chapter 5, Displaying Results with D3, we will go through the topic of displaying the results of analysis on big data using a typical web browser using Data Driven Documents (D3) in a variety of examples. D3 allows the ability to apply pre-built data visualizations to datasets.
Data Driven Documents is referred within the open community as D3.
These library components give you excellent tools for big data visualization and a data-driven approach to DOM manipulation. D3’s functional style allows the reuse of library code modules that you’ve already built (or others have already built) adding pretty much any particular features you need or want (or don’t want) to. This creates a means that can become as powerful as you want it (or have the time to make it) to be, to give a unique style to your data visualizations, manipulate and make it all interactive–exactly how you want or need it to be.
As discussed earlier in this chapter, big data is collecting and accumulating daily, in fact; minute-by-minute and there is a realization that organizations rely on this information for a variety of reasons.
Various types of reporting formats are utilized on this data, including data dashboards.
As with everything, there are various apprehensions as to the most accurate definition of what a data dashboard is.
For example, A. Chiang writes:
“A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance.”
Refer to the following link for more information: http://www.dashboardinsight.com/articles/digital-dashboards/fundamentals/what-is-a-dashboard.aspx.
Whatever the definition, any dashboard has the capacity for supplying timely, important information for its audience to use in decision making, if it is well designed and constructed.
It is critical that dashboards present data in a relevant, concise, and well-thought-out manner (not just a collection of visual representations in a workbook or spreadsheet) and in addition, dashboards have to have a supporting infrastructure capable of refreshing the dashboard in a well-timed manner as well as including some form of DQA. Making decisions based upon a dashboard with incorrectly presented, stale, or even incorrect data can lead to disaster.
Chapter 6, Dashboard for Big Data – Tableau, of this book offers examination of the topic of effective dashboarding and includes working examples demonstrating solutions for effectively presenting results based upon your big data analysis in a real-time dashboard format using Tableau.
Tableau is categorized as business intelligence software designed to help people see and understand data; more than just a code library, Tableau is considered to be a suite or a family of interactive data visualization products.
Tableau’s structure allows us the ability to combine multiple views of data from multiple sources into a single, highly effective dashboard that can provide the data consumers with much richer insights. Tableau also works with a variety of formats of (both structured and unstructured) data and can handle the volumes of big data, literally, petabytes or terabytes, millions or billions of rows, turning that big data into valuable visualizations for targeted audiences.
To address the velocity of today’s big data world, you can use Tableau to connect directly to local and cloud data sources, or just import your data for fast in-memory (more on in-memory later in this book) performance.
Another goal of Tableau is self-service analytics (which we mentioned earlier in this chapter and will talk more about later on), where a user can have a dialog with selected data to ask questions (in real time, not in a batch mode) using easy point-and-click analytics to mine big data intuitively and effectively discovering understandings and opportunities that may exist within the dataset or datasets.
Some of the more exciting abilities Tableau offers include:
- Real-time drag-and-drop cluster analysis
- Cross data source joining
- Powerful data connectors
- Mobile enabled
- Real-time territory or region data exploration
In Chapter 7, Dealing with Outliers Using Python, we will dive into Outliers.
As was defined earlier in this chapter, an outlier is an observation point that is distant or vastly different from the other observed data points within the data.
Although outliers typically represent (only) about 1 to 5 percent of your data, when you’re working with big data, investigating, or even just viewing, 1 to 5 percent of that data is rather difficult.
Investigation and adjudication
Outliers, you see, can be determined to be non-influential or very influential to the point you are trying to make with your data visualization.
The act or process of making this determination is critically important to your analysis, but it is also very problematic when dealing with the larger volumes, many varieties, and velocities of big data. For example, a fundamental step to help make this determination is called the sizing of your samples, which is the main mathematical process of calculating the percentage of outliers to the size of the data sample, which is not so simple a task when the data is in petabytes or terabytes!
Identifying and removing outliers can be tremendously complicated and there are many differences in opinions as to how to go about determining the percentage of outliers that exist in your dataset as well as determining their effect on the data and deciding what to do with them. It is, however, generally accepted that an automated process can be created that can facilitate at least the identification of outliers, possibly even through the use of visualization.
Carrying on, all the approaches for the investigation and adjudication of outliers such as sorting, capping, graphing, and so on require manipulating and processing of the data using a tool that is feature–rich and robust.
This chapter offers working examples demonstrating solutions for effectively and efficiently identifying and dealing with big data outliers (as well as some other dataset anomalies) using Python.
Python is a scripting language that is extremely easy to learn and incredibly readable, since its coding syntax so closely resembles the English language.
According to the article, The 9 most in-demand programming languages of 2016, by Bouwkamp, available at http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016, Python is listed in the top most in-demand programming languages (at the time of writing).
Born as far back as 1989 and created by Guido van Rossum, Python is actually very simple in nature, but it is also considered by the industry to be extremely powerful, fast, and it can be run in almost any environment.
As per www.python.org:
“Open sourced (and free!), Python is part of the winning formula for productivity, software quality, and maintainability at many companies and institutions around the world.”
There is a growing interest within the industry to utilize the Python language for data analysis and even for big data analysis and it is the exceptional choice for the data scientist to perform typical day to day activities as it provides libraries, in fact a standard library (even some focusing specifically on big data, such as Pydoop and SciPy) to accomplish almost anything you need or want to do with the data you have or are accumulating, including:
- Building websites and web pages
- Accessing and manipulating data
- Calculating statistics
- Creating visualizations
- Building predictive and explanatory models
- Evaluating models on additional data
- Integrating models into production systems
As a final note here, Python’s standard library is very extensive, offering a wide range of built-in modules that provide access to system functionalities, as well as standardized solutions to solve many problems that occur in everyday programming making this an obvious choice to explore for dealing with big data outliers and related processing.
In Chapter 8, Big Data Operational Intelligence with Splunk, of this book, we concentrate on big data Operational Intelligence.
Operational intelligence (OI) is a type of analytics that attempts to deliver visibility and insight from (usually machine generated) operational or event data, running queries against streaming data feeds in real time, producing analytic results as operational instructions, which can be immediately acted upon by an organization, through manual or automated actions (a clear example of turning datasets into value!).
Sophisticated OI systems also provide the ability to associate metadata with certain metrics, process steps, channels, and so on, found within data. With this ability, it becomes easy to acquire additional related information, for example, machine-generated operational data is typically full of unique identifiers and result or status codes. These codes or identifiers may be efficient for processing and storage, but are not always easily interpreted by human beings. To make this data more readable (and therefore more valuable) we can associate additional information that is more user friendly with the data results–possibly in the form of a status or event description or perhaps a product name or machine name.
Once there is an understanding of the challenges of applying basic analytics and visualization techniques to operational big data, the value of that data can be better or more quickly realized. In this chapter, we offer working examples demonstrating solutions for the valuing of operational or event big data with operational intelligence using Splunk.
So, what is Splunk? H. Klein says:
“Splunk started out as a kind of “Google for Log files”. It does a lot more… It stores all your logs and provides very fast search capabilities roughly in the same way Google does for the internet…”
Splunk software is a great tool to help unlock hidden value in machine generated, operational data (as well as other types of data). With Splunk, you can collect, index, search, analyze, and visualize all your data in one place, providing an integrated method to organize and extract real-time insights from massive amounts of (big data) machine data from virtually anywhere.
Splunk stores data in flat files, assigning indexes to the files. Splunk doesn’t require any database software running in the background to make this happen. Splunk calls these files indexers. Splunk can index any type of time-series data (data with timestamps), making it an optimal choice for big data OI solutions. During data indexing, Splunk breaks data into events based on the timestamps it identifies.
Although using simple search terms will work, (for example, a machine ID) Splunk also offers its own Search Processing Language (SPL). Splunk SPL (think of it as kind of like SQL) is an extremely powerful tool for searching enormous amounts of big data and performing statistical operations on what is relevant within a specific context.
This tutorial has been taken from Big Data Visualization by James D. Miller. Use the code ORSCF50at the checkout to save 50% on the RRP until the 30th of November.
You must log in to post a comment.