When he began working on his dream of creating an open source search engine to compete with Google eight years ago, software developer Doug Cutting never imagined that his fledgling project, Hadoop, would instead impact the world of enterprise data analysis.
But that's what happened once the Hadoop project drifted away from its original open source search goal and morphed into a creative answer to a drastically different problem. Cutting needed to build a way to handle, sort, store, and process the huge amount of data a complex search engine needed. In doing so, he realized that the process had incredible promise to help large corporations deal with the same kinds of big data problems in their own operations.
One of Hadoop’s strengths is that it can process and analyze huge amounts of unstructured data – video, audio, social media postings, images, and more – in ways that weren't before possible. That powerful capability opened many eyes in the IT world when Hadoop was established as an open source project through The Apache Software Foundation in 2008.
Why CIOs and IT managers should actively look into Hadoop
Apache Hadoop, as the open source version is known, is a data analysis application framework that allows large data sets to be processed across clusters of computers. It uses massively parallel computing power to tear through the data with incredible speed, compared to traditional data analysis. Using Hadoop, businesses can analyze petabytes of unstructured data far more easily than using other applications. And that's what makes it so invaluable to a growing number of businesses, from JP Morgan Chase, to EBay, Google, and Yahoo, all of which deal every day with unbelievably large amounts of data.
"That wasn't my focus at the time," Cutting says in a telephone interview. "I started it because I was working on a big data problem without having access to this [kind of] technology."
And that's where the light bulb turned on in his head: Those same kinds of big data problems had been a growing challenge for businesses as they sought new ways to deal with the constantly increasing amount of data they are generating.
"Hard drives have been getting larger at a phenomenal rate and processors are always getting phenomenally faster," Cutting says. "The question became, 'What are we going to do with all of this ever-more-data that's being generated?' Classic enterprise technology wasn't designed to handle all of this and doesn't really take advantage of this. And enterprises want those capabilities because there's lots of interesting data out there that they've been throwing away."
In the past, companies couldn't financially afford to keep all the data their customers and sales generated, so over time the data was archived or deleted. Now, though, cheaper storage and faster processing capabilities, matched with efficient analysis tools like Hadoop, allow large companies to save all of their valuable data. Over time, they can find new uses for it and drive sales, profits, and marketing.
"Retailers may not keep every transaction that's ever happened in the past, but now [they] can afford to," Cutting says. "Now they can go back and see things in the data about seasonal sales, sales based on location and demographics, and more. You can see what someone is buying in a city like Atlanta this year and compare it to last year's hot-selling goods and you can make a special offer to them," all using Hadoop. "You have the data and you can make much better predictions."
For businesses of any size, this can be a huge boon – but especially so for bigger companies.
"Now you can bring it online and analyze it and do better job understanding it," Cutting says. "That's the problem that we were able to address here. Now companies can find patterns that were impossible to detect before."
So what kinds of companies can take advantage of the massive parallel processing inherent in Hadoop? All kinds of mainstream businesses, he says, from retailers to credit card companies, are using Hadoop today to analyze transaction data to protect against fraudulent credit card uses. Banks are also using Hadoop to analyze huge amounts of data to predict consumer credit-worthiness, he says.
Power companies are also using Hadoop as they analyze their power grids and optimize electricity use for consumers. Medical imaging operations are using it to improve processes that produce data-intensive images of patients.
Meanwhile, the use of Hadoop is spreading as the needs of businesses continue to evolve.
"People find some data problem, then ask what's the best technology they can use, and then they start using Hadoop," Cutting says. "It's usually one big problem that just seems insurmountable with anything else, and then they find it has other uses, too."
How businesses can get started with Hadoop
Hadoop is an ongoing Apache open source project, so your developers can get right to work on it without having to ask for additional budget. But businesses don't have to fear a lack of commercial support. A growing number of companies offer enterprise-ready, supported and customized versions, including Amazon Web Services, Cloudera, Greenplum, Hortonworks, MapR, DataStax, and Datameer, making it easier for larger IT departments to deploy it, with the complete support and assistance they need. By bringing in a partner that offers enterprise-class features and support, IT departments can have a smoother ramp-up using Hadoop to begin solving their big data problems.
Cutting, who today is the chief architect at Cloudera, says Hadoop is still a young and evolving technology. It is still ripe for additional enterprise-needed features, such as versions that are custom-built and aimed at specific industry verticals, including financial services and medical informatics companies. "We're beginning to see a little of that now, and I think we'll see a lot more," he adds.
Improved user interfaces are also on the horizon from vendors, he says, as well as additional tools and other user and deployment aids. "Sure, there are obstacles; but I don't think that any of them are overwhelming at this point," he says. "First adopters have included a lot of Web companies. They are used to having a lot of developers writing a lot of code using lower level tools."
Devise a Hadoop strategy for your enterprise
If huge quantities of unstructured data are hampering your company's ability to get things done efficiently with your existing database and data analysis tools, you aren’t alone. Taking a hard look at Hadoop’s capabilities could go a long way toward helping you resolve your corporate data roadblocks. Hadoop doesn’t replace your existing databases, but adds powerful resources to your data handling toolbox.
Dan Olds, principal analyst at Gabriel Consulting Group, says he is hearing more discussions about Hadoop in the IT marketplace today, including at a supercomputing conference in November 2011. The reason, he says, is that Hadoop can help companies finally use huge stores of unstructured data that were tough to access in the past.
"This is the kind of data that companies want to analyze but don't have the time and resources to put it into a relational database," he says. "Plus they need to be able to do it quickly [for that analysis] to have value, and that doesn't happen quickly" when trying to put it into a database. "This is something that we're going to see more and more of in the enterprise."
These kinds of data handling needs require much higher computing power to make it happen, he says. "What you're looking at with these kinds of data problems is essentially a supercomputing data problem," Olds says. "You can't use traditional computing to deal with it. The data is moving too fast and it's too big."
All of this is being fueled not by IT departments, he says, but by something even bigger: global economics. Consumers can buy anything they want from anywhere in the world with one click or phone call.
"It's always getting more competitive," Olds says. "Buyers have all the power. That's where tools like this can help. If your IT people aren't hearing voices like that from the people on the business side, it's only a matter of time until they do. There's a deluge of data and requests from business people to analyze it in quicker ways. The need for this is rising because of basic economics, globalization, and improved communications."
Olds calls it "the age of analytics."
"Companies need to do this to stay competitive, to rise above their competitors, and to just survive," Olds says. "We're entering a period where perhaps the key differentiator between companies will be how quickly they can use the data they can get. It's all about the availability of data and the ability to analyze it quickly so companies can make decisions."
- The Input/Output Active Information blog, tracking developments in big data
- A New Mathematics for Computing
- Replicated Databases Improve Development, Testing, and Collaboration
- Do Supercomputers Still Matter?