When big data gets too big, agencies seek creative approaches

As access to data gets easier, agencies are looking for ways to sift through the noise and find the most valuable pieces that can help them make more targeted, impactful decisions.

With so much data, many agencies compared that task to searching for a needle in a haystack.

“[Data] is not the final answer, but it definitely tells you where you might want to look further,” Caryl Brzymialkiewicz, assistant inspector general and chief data officer for the Health and Human Services Department, said during a discussion at IBM’s Government Analytics Forum in Washington May 5. “What requires or what warrants further scrutiny? When you talk about looking for the needle in the haystack … part of this is how do you reduce the haystack?”

That’s the question many agencies — from the departments of Homeland Security and Health and Human Services to the National Institutes of Health — are experimenting with and asking for industry input.

Advertisement

Michael Schoenbaum, senior adviser for mental health services at NIH’s Office of Science Policy, Planning and Communication, worked with the Army and Defense Department to identify, define and compare data sets to develop better suicide prevention programs in the military.

“We are absolutely about using predictive analytics to build smaller haystacks with a higher concentrations of needles,” he said. “We’ve now demonstrated for our applications that we can do that when the data is available.”

Schoenbaum said the Army and NIH started with more than 40 different data sets when they first began studying characteristics of suicide in the military. The information became more meaningful, once the team whittled the data down and looked more closely at the most important information.

Homeland Security Investigations seized nearly six petabytes of data last year, which amounts to about 77 years of high definition video, unit chief Jamie Holt said. DHS Investigations has 6,000 agents and 300 forensic analysts — not nearly enough manpower to review such a vast amount of information.

“Trying to review all that data and understand where all the pieces fit together is a bit of an impossible job,” Holt said. “A lot of that data is related to child exploitation and child pornography, and we recognized that it was a significant problem. We had a nine month backlog on trying to review all the computers and the data to look through these images.”

Homeland Security Investigations reached out to industry and non-profit organizations for help. Together, the unit developed Project VIC, a program that sifts through hundreds of thousands of images, Holt said.

“We have a program that will automatically cull through all of that data and separate images that have already been identified,” she said. “That may only leave 300,000 images for an agent to go through and try to identify additional victims. It cut down the amount of time that we are analyzing the data.”

The unit’s backlog is now down to about a month, Holt said.

She asked for industry’s help to develop better tools that could help DHS make better sense of its data.

“A lot of the tools or solutions operate in a stovepipe,” she said. “One vendor may be able to sell me a translation capability. One vendor may be able to sell me a dark-net scrapping capability. Another one will scrape the Clearnet for information, but they aren’t all integrated. It takes a lot of time for the human to sit there and process all of that information within each individual tool.”

The lag time between the time an agency gets the data to the time it analyzes the information is also an issue, but it’s slowly getting shorter.

For example, it took the Centers for Medicare and Medicaid Services a year or more to analyze administrative health and claims data and pull out the trends. CMS is working in near-real time to analyze its data in two to three months, said Allison Oelschlaeger, senior adviser for the CMS Office of Enterprise Data and Analytics.

For the Homeland Security Investigations unit, better IT solutions certainly play a role in sorting through massive amounts of data. But more manpower helps as well.

In partnership with the U.S. Special Operations Command, DHS is training veterans on forensic data analysis through a year-long internship program. Roughly 80 percent of the veterans who finish the program get hired with DHS or another organization in the field, Holt said.

Like nearly every other federal agency, Homeland Security Investigations is vigorously looking to recruit and hire top cybersecurity talent. Holt said that search for new cybersecurity professionals is important, but there’s often another piece missing from those conversations.

“There’s not a lot of focus on cyber crime, and they’re two different things,” she said. “That’s one of the things that we’re pushing our recruiter to reach out to the different colleges and develop different programs that focus on cyber crime as well as cybersecurity.”