This is a story about what happens when disruption — that critical quality that venture capitalists look for in a new product — gets overly disruptive. Up front, I assure you this tale has not only a happy ending but a positive payoff.
It begins with something that marketers haven’t found a better catch-phrase for than “big data.” If I were to define it, I would be wrong. At least, that’s what the results of a Harris Interactive survey would indicate. In the survey, released in June 2012, 154 C-level executives from U.S.-based multinational companies were asked whether they perceived the onset of “big data” in their business as more of a challenge or an opportunity. On a six-point scale, 76% of responses were weighted more toward the “opportunity” side.
And then they were asked to define the phrase “big data.” Given five possible choices, one of them being “Other,” the responses were roughly evenly split.
There appears to be general agreement on the notion that big data is big. About 70% were certain their investments in big data technologies would provide positive payoffs within one year. But there’s no general consensus as to what it is.
The Big Split
Poring through the Harris poll data shows that 42% of respondents worked for companies with $500 million or more in annual revenue. When responses were distributed along that scale, the definitions started clearing up. In other words, the more your company earns, the more the function of big data changes.
For smaller companies, the challenge is dealing with large numbers of transactions from a Web presence, especially mobile purchases. Larger companies have infrastructure to deal with that. So they’re focused solely, or at least more, on things like machine-generated data, cell phones, devices, sensors, and so forth, as well as social data.
The biggest selling point for new entrants in the big data space in recent months has been something called “consumer sentiment analytics” — essentially, the ability to ascertain, in real time, what customers and prospective customers think about your product based on their public conversations on Twitter, Facebook, and elsewhere. For this function to really work, these companies argue, you first need to capture a whole lot of social media conversations, replicating cyberspace in your own little warehouse. As a result, some companies actually treat big data and consumer sentiment analysis as synonymous, and this could be where some of the confusion rests.
So here’s my definition, and if I’m wrong, go take it up with Louis Harris: Big data is any collection of raw data, usually unprocessed, perhaps not yet even formatted, that is warehoused within multiple volumes of data nodes (and by multiple, I may mean thousands, I may mean millions), prior to its being analyzed and prior to being prepared for a formal, transactional database. Essentially, you’ve got to have some place to keep this stuff, and there’s no longer time to make it neat first.
In the past, a collection of data was a database. That’s not really the case anymore, because “database” implies some level of organization. The purpose of big data systems (the one I focus on here being Hadoop) is to simply store unprocessed data in some form that enables it to become organized. Such systems then also provide tools for analyzing data in its raw form, so that the most important elements of that data can be identified and extracted — a process that Hadoop calls, for lack of any better word, MapReduce.
The Accidental Operating System
Had the traditional database industry devised this solution to the big data problem, it might have cannibalized its own legacy products. That’s because big data warehousing does require, to some extent, an upending of the foundation. So it was that Hadoop emerged from the search engine business, with arguably the largest assemblies of data anywhere in the world.
In 2008, engineers at Yahoo had been developing what they would only come to realize later was an actual operating system: a file and folder storage facility for large numbers of volumes. This was Hadoop, named (as legend has it) after one developer’s child’s toy elephant. Although Hadoop’s purpose was fourfold, its job was radically simple:
- Create a huge pool of storage from a collection of any available resources
- Replicate the data to minimize loss
- Scatter the chunks in raw, unprocessed form throughout the pool, and
- Spawn massively parallel, background processes for digesting that data into reduced, manageable blocks just prior to being accessed.
“In a lot of ways, Hadoop acts like an operating system. It’s got the file system piece, and it’s got the compute piece. So you could look at it like a new kind of operating system [that’s] running over a whole cluster of machines,” says Owen O’Malley, one of those Yahoo engineers. Now O’Malley is a software architect at Hortonworks, a commercial venture created specifically to support and develop Hadoop. “The important piece to start with is to get the file system secure and the computation engine secure, so that you can trust that the user is who he says he is.”
Hadoop is just one player in the emerging “big data” market, which also includes the Apache Cassandra project and the MongoDB storage system for JSON-format data. But it now has the highest-profile customers with the largest databases anywhere in the world, including Facebook and LinkedIn.
Because Hadoop broke the volume barrier, it broke the prevailing security model for servers. Think about it: Historically, if a database resided on a server, and the server was secure, the database was presumed secure. Even if you rendered the server virtual, spreading it out over multiple processors but leaving it contiguous, the server security model was the same. So database developers relied on whatever operating system or virtualization scheme served as their platform, to provide such things as authentication, encryption, and session management. For decades, security was out of their purview.
Hadoop turned the database model inside out. Its system contains servers, assigning them processes and marshaling data in colossal chunks between them. Suddenly, rather than being dependent upon an outer shell of security, the outer shell became the inner core.
“When we started with Hadoop, there were no provisions at all. Everyone could see everything, everyone could change everything,” O’Malley admits. As he warned in a 2010 presentation for his own company (PDF available here), Hadoop’s first users — literally called “Yahoos” — were trusted implicitly. After all, who outside of Yahoo knew the project even existed?
Where Openness Fails
Realizing this could be a problem when Hadoop broke through Yahoo’s laboratory walls, O’Malley tells us, his team had to discover what security actually meant in this new context. “We were very concerned that we not bolt security on, but rather bake it into the internals,” he says. “So we got into the guts and baked security from the inside out. But as we were doing that work, some of the companies that were using Hadoop were saying, ‘Why are you bothering? Don’t you trust the other employees at your company?’ The answer is, you need to trust someone, of course, or you can’t get anything done. On the other hand, you want to adjust that set that you trust to as small as possible, you also want to be able to audit what they’ve done, and you want to make sure they only do the things that they’re authorized to do.”
They created a rudimentary user provisioning model, but he explains, it didn’t prevent Yahoos from inadvertently causing accidents: “People accidentally deleted project directories for someone else’s project. One time when Hadoop was being used by a class, [a student] was trying to create his own directory, and instead wiped out all the users’ directories. You definitely had accidents happen.”
In 2009, in order for Yahoo to collect more input on security and other architectural aspects from the best and the brightest, it wisely transferred oversight responsibility for Hadoop to the Apache Foundation, rendering its core and its analytical components open source. The following year, under Apache’s auspices but with the leadership of the Yahoo team, Hadoop’s contributors began building a formal security strategy, starting by adopting the same encryption and authentication system used for Java applications: Simple Authentication and Security Layer (SASL). It seemed like a good choice at the time, especially since SASL employs the same Kerberos framework used to secure systems such as Windows Server.
But as Hadoop clusters transformed the infrastructure of the Internet colossi, it became clear SASL was too small a solution. As iSEC Partners senior security consultant Andrew Becherer demonstrated at a Black Hat conference (PDF available here), in order to speed things up (an excuse we’ve all heard too often before), Yahoo had designated some processes with “super-user” status — meaning they were beyond the need for authentication. Even more critical — potentially catastrophic — was the matter of key management: how nodes in a cluster recognized the credentials of all the other active nodes. Specifically, in Hadoop architecture, an authenticated node is granted a Block Access Token (BAT). For the other nodes to even be able to read that token, the Kerberos symmetric key used to generate it had to be replicated to the other nodes — which, at Facebook scale, means potentially millions of systems.
The opportunity for those tokens to be intercepted grew exponentially as the clusters grew linearly. Sound familiar?
“If the shared key is disclosed to an attacker the data on all Data Nodes is vulnerable,” warned Becherer. “Given Data IDs the attacker could craft Block Access Tokens, reducing security of Hadoop to the previous level.” By “the previous level,” he meant zero.
The news of potentially zero security came around the same time the National Security Agency became one of its largest customers. Sources tell me that the way the NSA manages Hadoop security, at least for now, is to separate its clusters from Internet access entirely. Or at least, maybe, that’s the plan. The basic authentication system first employed by Hadoop’s HDFS file system layer uses proxy IP addresses to represent users. Can you imagine a Web browser for one of those users not having access to the public Web?
“The airtight box is one approach to security,” says Ted Dunning, chief application architect for commercial Hadoop distributor MapR. “It’s always going to be the approach that systems with extensive cryptography and extensive security in mind will have to follow; the NSA is always going to have to have an airtight box. And eBay will have to have its PCI-compliant clusters in an airtight box, much the way that they probably do with their Oracle database. That’s because the threat and the risks are so high, you have to have all levels of protection in-depth.
“On the other hand, there have been some improvements, but they’re not by any means done,” Dunning continues. “The open source systems have added a minimal level of network security. They’ve added a token which, unfortunately, is easily hijacked and can be used in a man-in-the-middle attack on the network layer. MapR has started at the other end, [by] integrating authentication, process hardening, and compliance, though it has not yet done the network security that it will be needing.”
Nobody is sitting on his hands. A big data security framework is being developed — and quite rapidly — by commercial entities, some of which, like MapR, are just now coming onto the scene for this very purpose. But that’s a discussion for another time.