As an industry, “big data” is still in its infancy. It’s the practice of storing data in its raw, often massive, form and gleaning from it the meaningful items you need to build a regular database. (If you disagree, see the first part of this story.) Big Data sprang from a solution created by the search engine industry, except it was adopted by businesses and even governments so fast that its security model hadn’t even yet been considered, let alone implemented.
Only in recent months has the rapid rise of Hadoop made way for new commercial interests, including three companies whose business model is based solely on Hadoop support: Hortonworks (named for a different elephant), MapR (short for the “MapReduce” function used in Hadoop analytics), and Cloudera. Their unbelievably sudden commercial success has enabled the rapid emergence of a new and tighter Hadoop security model — one whose principles could also apply to Cassandra and MongoDB. It’s based on five principal components:
Session Encryption. Kerberos is not the wrong choice for encryption and authentication by any means, from a standpoint of principle. It just needs to be implemented in a way that doesn’t create new opportunities for exploitation that didn’t exist before.
“Big data is typically cloud-based, be it a private or public cloud. We can thrive in those environments,” boasts Todd Thiemann, a senior product marketing director with Vormetric. Thiemann’s company makes a data center-based appliance that enables encryption at all levels, especially for data that must traverse the public Internet to reach its cluster.
“Encryption, historically, has been a performance burden, chewing up a fair number of CPU cycles,” Thiemann notes. “Our data security manager, which is a hardened appliance with the keys, sits in the data center. You have control of that data in the data center, even though the data itself might be distributed across private and public clouds.” Vormetric supports Intel Advanced Encryption Standard - New Instruction Set (AES-NI), which includes seven key AES instructions in the microprocessor. “By taking those instructions from software into hardware, you can significantly speed encryption and decryption operations, and as a result, get a lot better performance,” he says. Vormetric has made such appliances for some time, but has been shipping this adapted version only since May 2012.
“Now that encryption is easy, and key management is sophisticated, and the cost of doing it in the cloud is so minimal, quite frankly, it should probably be a best practice that, if you have anything in the cloud and it’s sitting somewhere in storage, on disk, you should encrypt all of it.” This from Larry Warnock, one of the original marketing masterminds behind the Vignette content management system, now the CEO of key management system provider Gazzang. His product, called zTrustee, is a universal key management system that manages the encryption of objects — nebulous things whose identities or functions it doesn’t have to know or understand. As a result, coupled with Gazzang’s zEncrypt (this company loves the letter “z”), zTrustee can be transparently applied to Hadoop, with the result being that keys don’t have to get replicated and transferred to a million different nodes.
“We address encryption at several layers. Definitely, the core for big data is at the file system layer,” explains zEncrypt’s chief architect, Eddie Garcia. So suppose you have a petabyte of data from Amazon’s S3 storage, on which Hadoop is preparing to run a MapReduce job (extracting the most relevant samples or averages from a large cluster). Garcia explains that the entire petabyte will be encrypted at the time it reaches Amazon, not afterward.
“Of course, all the different pieces are also encrypted,” he adds. “There’s an encryption key, and the protocols that talk between the key server and the data nodes is encrypted itself, using SSL plus a secure protocol between them. Then on the actual key server itself, the data is also encrypted there. So there’s encryption going on at various levels on the key server, at the transport level with data in transit, as well as data at rest — where you have this petabyte of big data, in this case.”
Key Management. In every security platform, the weakest link is passwords. One of the original goals of Kerberos was to replace ordinary passwords (many of which are still as stupid as “4444”) with impossible-to-guess certificates. But when the key management system for those certificates is governed by a password-based user access control, it’s pointless. And when keys themselves are managed within ordinary databases, many of which are transmitted in their entirety in the clear every day as a management procedure, it doesn’t matter how impossible-to-guess a certificate may become.
One solution being tried by Gazzang involves disassociating trust from awareness, or more explicitly, removing the notion that just because an agent is authorized to allow access to an object necessarily means that agent has any idea what that object is.
The trustee in Gazzang’s key management system is a new kind of agent (though not as new as Hadoop) that’s like a postman you can trust to deliver keys without reading them. As Larry Warnock explains, “A process or a person is required to authorize the release of a key, but they never see it.” One example involves an SSL certificate that must be pushed down to an Apache server; eight active trustees are notified of the event. Five of those trustees vote “yes” to the release, and majority rule triggers its release, even though none of those five ever see the certificate itself. This policy may be altered depending on the critical nature of the data — perhaps only one trustee is enough, or perhaps the vote must be unanimous. And perhaps certain trustees should be consulted in sequence.
Access Control. A traditional operating system attributes accounts to every agent making a transaction, in order to ensure each agent is authorized for that transaction. Hadoop had nothing like this in the beginning. Creating a separate access control system for Hadoop would effectively subordinate, or even invalidate, existing systems at the conventional OS level, unless there was some type of federation between them — and we don’t want to go there again. So Hadoop is learning to turn existing access control inside out as well, bringing familiar, existing constructs such as Active Directory into the cluster.
“The important piece to start with is to get the file system secure and the computation engine secure, so that you can trust that the user is who he says he is,” says Owen O’Malley, software and security architect at commercial Hadoop distributor Hortonworks. “Of course, that has to fit into the architecture of the surrounding ecosystem.” So rather than employ yet another volatile password-based system, Hortonworks and others are making it possible for businesses to leverage Active Directory as the key distribution center (KDC), setting up Hadoop users around the existing tree of AD organizational units.
“Fortunately, Kerberos was set up as a distributed authentication mechanism, and Active Directory is just an implementation of a Kerberos KDC from Hadoop’s point of view,” he adds.
In June 2012, MapR made its commercial-grade distribution of Hadoop available over Amazon’s cloud, which many businesses are using in lieu of purchasing or leasing warehoused storage. As MapR Chief Application Architect Ted Dunning tells us, in doing so, MapR is able to leverage Amazon’s existing security architecture — which is already a shell around storage clusters — for Hadoop’s purposes, including user access control.
“There’s a level of security for some environments that is adequate today,” states Dunning. “If you have a small team, and you want to fix your system on Amazon, the network security that Amazon provides is entirely adequate. And as long as you don’t have a very persistent mole in your team, you should be relatively good to go.” Dunning goes on to warn that Amazon’s quality of security may not be adequate enough for PCI compliance or for national security purposes.
Policy management. Once users are authorized, that authority should be limited. “A lot of security isn’t about trusting one’s motives,” says Hortonworks’ Owen O’Malley (who knows from a lot of experience), “but also preventing mistakes from getting out of control.”
Vormetric’s key management system, Todd Thiemann reminds us, rests in the data center, where it acts as what he describes as a firewall for data. “Who’s making this request? Is it appropriate? Is it consistent with policy?” he asks. “If yes, the request goes through... We’re providing a segregation of duties for privileged users which you would not find with operating system-level controls, where the root user might have access to everything. We’re allowing enterprises to slice things up so that the application, the user, or the process that should have access to that data, can access it, and other users who maybe don’t need access to the data itself, but do need access to the file to do their job — backup, recovery, what have you — can do that without seeing the cipher text, the encrypted data.”
Auditability. With a traditional, relational database, transactions are logged. Oftentimes, whether a user has the authority to make a transaction depends on whether she has been granted access to the resource that’s hosting the data. But as Gazzang’s Garcia tells us, a typical MapReduce process may involve hundreds of servers simultaneously. So the authorization model changes.
“What used to be the role of a sysadmin or a CSO [chief security officer] to log on, and key up the password for encryption, is no longer feasible when you’re talking about hundreds or thousands of servers, and you’re turning these hundred-server MapReduce jobs for a couple of hours, shutting them down, and then doing them again — to have this one trusted person who knew the password to encrypt the data. Now you need something that can automatically provision secrets.”
Gazzang’s zTrustee system segregates duties among multiple parties. But to effectively account for which party is responsible for what, systems must be auditable, especially because the handoffs of encryption keys become somewhat complex negotiations... in order to minimize their number. Tracing the cause of discrepancies (which there will inevitably be, in any network of millions of nodes) requires logs that can be logically deciphered by something other than human eyes alone... because there won’t be enough human eyes to scan the logs.
A chief data scientist for an organization is different from a security officer. Most people define a data scientist as an analyst who lives in California and gets paid more. But a data scientist is someone who notices massive growth in noise, and recognizes that people want to find new signals in it without getting lost in it.
Most large companies are beholden to their income statements and must drive their companies based on those metrics. Smaller companies can look for new and unique signals in the noise. But you need someone whose responsibility this is. Without someone who is immensely qualified to mine, dig, and search for those signals, the business will never find those signals. You’ll just collect data for data’s sake.
Which brings us right back to where we started: Big data is a function in search of meaning, a thesis in need of anyone who can properly articulate it. That makes people the most important security component in any big data cluster. In the meantime, it’s a chicken-and-egg scenario with respect to finding someone to help make sense of big data. With any Internet-based technology, the signal-to-noise ratio always seems so impossibly small, until you can finally come down to the one person capable of making all the connections.
- Wrapping Your Head Around Big Data
- Data entrepreneurs, I'll choose my own movie. Go study cancer.
- 8 Rules for Big Data
- Your Cloud Won’t Secure Itself