The recent leaks about the NSA’s use data mining and predictive analytics has certainly raised awareness of our field and has resulte
The recent leaks about the NSA’s use data mining and predictive analytics has certainly raised awareness of our field and has resulted in hours of discussions with family, relatives, friends and reporters about what predictive analytics can (and can’t) do with phone records, emails, chat messages, and other structured and unstructured data. Eric Siegel and I have been interviewed on multiple occasions to address this issue from a Predictive Analytics perspective, and in case, in the same article: “What the NSA can’t do with your data (probably)”. Part of my goal in these conversations has been to bring back to reality many of the inflated expectations of what can be done with predictive analytics: predictive analytics is a powerful approach to finding patterns in data, but it isn’t magic, nor is it fool-proof.
First, let me be clear: I have no direct knowledge of the analytics the NSA is doing. I have worked on many fraud detection projects for the U.S. Government and private sector, some including what I would describe as a “social networking” component to them where the connections between parties is an important part of the risk factors.
The phone call meta data shows simple information about each phone call: origination, destination, date of the call, duration, and perhaps some geographic information about the origination and destination. One of the valuable aspects of the data is that connections can be made between origination and destination numbers, and as a results, one can build social networks of every origination phone number in the data. The U.S. has more than 326.4 millions cell phones subscriptions as of December 2012 according to CTIA. The Pew Research survey found that individual cell phone users had on average 664 social connections (not all of which are cell connections). The number of links needed to build a U.S.-wide social map of phone call connections easily outstrips any possible visualization method, and therefore, without filtering connections and networks, these social maps would be useless. One of the factors working in our favor, if we are concerned with privacy issues related to this meta data, is therefore the sheer size of the network.
The networks of phone calls, I believe, are particularly useful in connecting high-risk individuals with others whom the NSA may not know beforehand are connected to the person of interest. In other words, a starting point is needed first and the social network is built from this starting point. If one has multiple starting points, one can also find linkages between networks even if the networks themselves don’t overlap significantly.
The strength of a link can include information such as number of calls, duration of calls, regularity of calls, most recent call, oldest call, and more. Think of these are a cell-phone version of RFM analysis. The networks can be pruned easily based on thresholds for these key features, simplifying the networks considerably.
But even if the connections are made, this data is woefully incomplete on its own. First, there is no connection to the person who actually made the call, only the phone number and who it is registered to. Finding who made the calls requires more investigation. Second, it doesn’t necessarily connect all the phones an individual might use. If a person uses 5 or 6 cell phones, one doesn’t know that the same person is behind these phone numbers. Third, one certainly doesn’t know the intent or content of the call.
Given these limitations, what value is there to the network of calls? These networks are usually best used as lead-generation engines. Which other phone numbers in a network are connected to multiple high-risk individuals (but weren’t here-to-fore considered high risk)? Is the timeline of calls temporally correlated with other known events?
Analytics, and link analysis in particular, provide tremendously powerful techniques to identify new leads and remove unfruitful leads by finding connections unlikely to occur randomly.