Big Data Success in Government

On January 19, Carahsoft hosted a webinar on Big Data success in government with Bob Gourley and Omer Trajman of Cloudera. Bob began by explaining the current state of Big Data in the government. There are 4 areas of significant activity in Big Data. Federal integrators are making large investments in research and development of solutions. Large firms like Lockhead Martin as well as boutique organizations have made major contributions. The Department of Defense and the Intelligence Community have been major adopters of Big Data solutions to handle intelligence and information overload. Typically, they use Big Data technology to help analysts “connect the dots” and “find a needle in a haystack.” The national labs under the Department of Energy have been developing and implementing Big Data solutions for research as well, primarily in the field of bioinformatics, the application of computer science to biology. This ranges from organizing millions of short reads to sequence a genome to better tracking of patients and treatments. The last element in government use of Big Data are the Office of Management and Budget and the General Service Administration, which primarily ensure the sharing of lessons and solutions.

Gourley also recapped the Government Big Data Solutions Award presented at Hadoop World last year, highlighting the best uses of Big Data in the Federal Government. The winner was the GSA for USASearch, which uses Hadoop to host search services over more than 500 government sites effectively and economically. The other top nominees were GCE Federal, which provides cloud-based financial management solutions for federal agencies using Apache Hadoop and HBase, Pacific Northwest National Laboratory for the work of leading researcher Dr. Ronald Taylor in the application of Hadoop, Mapreduce, and HBase to bioinformatics, Wayne Wheeles’ Sherpa Surfing which uses Cloudera’s Distribution including Apache Hadoop in a cybersecurity solution for DoD networks, and the Bureau of Consular Affairs for the Consular Consolidated Database, which searches and analyzes travel documents from around the world for fraud and security threats.

Omer Trajman then gave some background on Apache Hadoop, the technology that powers many of these solutions. Narrowly defined, Hadoop consists of Hadoop Distributed File System, which allows for distributed storage and analysis on clusters of commodity hardware, and MapReduce, the processing layer that coordinates work. But when people say Hadoop they typically mean the entire ecosystem of solutions. Hadoop is scalable, fault tolerant, and open source, and can process all types of data. Trajman explained some of the members of the Hadoop ecosystem such as Sqoop, developed by Cloudera and contributed to Apache, which brings SQL capabilities to Hadoop; Flume, which moves massive amounts of data into Hadoop as it is being processed; HBase, a Hadoop database; Pig, a data-flow oriented language for routing your data; and Hive, which delivers SQL-based data warehousing in Hadoop. All of these solutions are integrated in Cloudera’s Distribution including Apache Hadoop, available for free download. Trajman also explained Cloudera’s enterprise software to help enterprises manage their Hadoop deployments.

Listeners asked how they could learn about Hadoop on their own and were pointed to the Cloudera website, which has lots of free resources, documentation, and tutorials, as well as regular courses on Hadoop around the country. Another attendee asked how big your data has to be to warrant Hadoop and how big was too big, to which Trajman replied that if a job is too large for a single machine to handle effectively, Hadoop is a good option. As of yet, no job is has been found to be too large, since you can add as many machines as you need into a cluster, and Hadoop now supports federation, or clusters of clusters for truly massive jobs. When asked who uses Hadoop, he explained that we all do through services like Twitter, Facebook, LinkedIn, and Yahoo. Bob explained for attendees when it makes sense to migrate data to Hadoop and the type of problems are best taken on by Hadoop. If you have too much data to analyze on your current infrastructure you should consider moving it to Hadoop and, while not every problem is well suited for distributed computing, “if a problem is partitionable, it’s Hadoopable.”