NSA's Accumulo data store has strict limits on who can see the data
With its much-discussed enthusiasm for collecting large amounts of data, the NSA naturally found much interest in the idea of highly scalable NoSQL databases.
But the U.S. intelligence agency needed some security of its own, so it developed a NoSQL data store called Accumulo, with built-in policy enforcement mechanisms that strictly limit who can see its data.
At the O’Reilly Strata-Hadoop World conference this week in New York, one of the former National Security Agency developers behind the software, Adam Fuchs, explained how Accumulo works and how it could be used in fields other than intelligence gathering. The agency contributed the software’s source code to the Apache Software Foundation in 2011.
“Every single application that we built at the NSA has some concept of multi-level security,” said Fuchs, who is now the chief technology officer of Sqrrl, which offers a commercial edition of the software.
The NSA started building Accumulo in 2008. Much like Facebook did with its Cassandra database around the same time, the NSA used the Google Big Table architecture as a starting point.
In the parlance of NoSQL databases, Accumulo is a simple key/value data store, built on a shared-nothing architecture that allows for easy expansion to thousands of nodes able to hold petabytes worth of data. It features a flexible schema that allows new columns to be quickly added, and comes with some advanced data analysis features as well.
Accumulo’s killer feature
Accumulo’s killer feature, however, is its “data-centric security,” Fuchs said. When data is entered into Accumulo, it must be accompanied with tags specifying who is allowed to see that material. Each row of data has a cell specifying the roles within an organization that can access the data, which can map back to specific organizational security policies.
It adheres to the RBAC (role-based access control) model. This approach allowed the NSA to categorize data into its multiple levels of classification—confidential, secret, top secret—as well as who in an organization could access the data, based on their official role within the organization. The database is accompanied by a policy engine that decides who can see what data.
This model could be used anywhere that security is an issue. For instance, if used in a health care organization, Accumulo can specify that only a patient and the patient’s doctor can see the patient’s data. The patient’s specific doctor may change over time, but the role of the doctor, rather than the individual doctor, is specified in the database.
The NSA found that the data-centric approach “greatly simplifies application development,” Fuchs said.
Because data today tends to be transformed and reused for different analysis applications, it makes sense for the database itself to keep track of who is allowed to see the data, rather than repeatedly implementing these rules in each application that uses this data.
“Since the applications in this model can push down the security model into the database and companion components, you don’t have to solve that in the application,” Fuchs said. As a result, “it is a lot cheaper to build that application,” Fuchs said.
This is not the NSA’s first foray into releasing open-source applications built on the role-based access model. In 2000, the agency released SELinux (Security-Enhanced Linux), which allows administrators to create policies that dictate what actions each program on a computer can execute, based on the user’s role. SELinux was subsequently rolled into the mainline Linux kernel.