Better than Brute Force: Big Data Analytics Tips

February 21, 2012

Long before I encountered the term “Big Data,” questions about dealing with large datasets were a routine part of my work. This situation was by no means unique. Government agencies and many businesses have been amassing repositories of detailed data for a long time.

Long before I encountered the term “Big Data,” questions about dealing with large datasets were a routine part of my work. This situation was by no means unique. Government agencies and many businesses have been amassing repositories of detailed data for a long time.

Consider, for example, the transaction history of a large retailer. A single transaction might include one item, or dozens. For each item, there may be several descriptors, such as a product ID and price. Besides the items purchased, there is the time at which the transaction took place, the register, the cashier, customer payment information and more. Each item corresponds to a lot, distributor, manufacturer, shipper and so on. The customer corresponds to another stream of information, covering previous purchases, loyalty program status and marketing history. This is adding up to a lot of data, and we’re still describing just one transaction.  A department store chain, grocer or big box retailer may handle a thousand transactions or more each day, year round, at each of hundreds, even thousands, of outlets.

Lately, I have heard “Big Data” used to describe everything from baseball statistics, which are far smaller in scale than the retailer example, to online retailing and social media, where the data resources can become enormous. Big is relative, yet when people ask, “Can you handle our data?” they all have certain shared concerns.

The prospect who asks if you can handle a large volume of data is looking for reassurance, and proof that you have something of value to offer. Here are some thoughts behind the question…

It’s hard for us to store this much data, let alone do anything useful with it.

We’ve tried some things that didn’t work.

Some of the things that didn’t work crashed the system.

We spent a lot of money on the last thing that didn’t work. Spending a lot of money on another thing that doesn’t work could be a career-ending move.

Of course, your answer will boil down to “Of course we can handle your data!” but if you say it that way, most people won’t believe you, and for good reasons. They want an answer that addresses those unspoken concerns.

So what’s the right way to respond? Answer the question with questions. At least, that is the way to begin. (This is true whether you are a vendor, outside consultant, or an internal resource such as a business analyst.) First, ask about goals. What kind of questions does your prospect expect to answer? Why? How will the information be of use to the business? What’s the vision for integrating analytics into decision-making and business processes?

Why begin with these questions? Primarily to learn the answers, because the answers will guide you in everything else you ask and do. But there are other reasons as well. Asking questions about business goals helps your clients to become aware of gaps in their own reasoning and other challenges of meeting expectations. The process of describing goals can easily lead someone to a realization that the data doesn’t support their wants, and some redirection is necessary. You’ll be saving yourself and your client a lot of trouble by getting the big questions on the table from the start.

Your clients are, most likely, a smart bunch of people. They are experts in their own professions, not yours. They may be unsure of how to evaluate you and the services you offer. Asking questions in a respectful manner is a way to show that you care about them and what they do. If you have put serious thought into the questions you ask and listen to the answers, you’re very likely to uncover some issues of importance, including some your prospect had not previously considered. Your client develops an appreciation for what you know, what you can do for them. As you develop an understanding of needs, you prospect develops respect and trust for you.

While you ask questions about Big Data analysis goals, keep asking yourself how much data is required to address the client’s goals. Just because an organization has endless heaps of data doesn’t mean there is a reason to touch and feel every bit of it.

Is the client asking for information about every single person in the data – individually? Very often, that’s not the case. And if that’s not the case, handling the data becomes simpler. Focusing on particular segments? OK, then you need data relevant to those segments, not everybody and her mother. Is every field in the database relevant to your research? Are there open-ended text fields, and if so, do you need them? Narrowing the data to just the relevant cases and fields can easily reduce the volume of data by a factor of ten, or one-hundred, one-thousand… you get the picture.

If the goals center on understanding behaviors of large numbers of people (or transactions, or some other things) as a group, then you have discovered another fine opportunity to reduce scale. Because you don’t need to use every single individual to do a good job of describing the group. What you need is a sample. There are statistical approaches to sampling, and there are data mining approaches, use what suits your client’s needs and your own work processes.

If you don’t trust or use sampling, you don’t know drivel about data analysis; go back to school and stop wasting clients’ time and money.

What if your client really does need to be able to address every single case in a huge repository? This is certainly possible. Perhaps your client wants to rate the profitability or purchasing potential of every customer. Or maybe you’re dealing with insurance claims – which ones are potentially fraudulent? Tax payment information – who’s cheating? Situations like these do call for handling lots and lots of data.

Still, there are opportunities to minimize the resources required. Break big questions into small pieces, use sampling early and often, and educate yourself about the resource requirements for the things you do. Scoring a million cases using your predictive model, for example, may require less computing power than building the model itself on a sample of a few thousand cases. This information isn’t always easy to find; making nice with vendor tech support and you client’s IT staff may lead you to invaluable tricks of the trade.

Make it a point to minimize your demands on the data at every step of the analytic process. Exploring the data? If you’re looking to understand what’s typical, use a sample, not the whole dataset. Seeking the extreme and unusual? You’ll need more data to work with, but still may be better off with a subset of the data, at least to begin. You’re ready to build models? Often, you’ll get the best results by modeling segments one at a time. Throwing everybody into one giant equation is asking for a weak model. Take a sample of training data from a single segment and you can work faster, often producing more accurate models.

When you tell a prospect that you can handle the data, it shouldn’t mean you’ll do it by brute force. A little finesse yields stronger results with less strain on resources.