Big data has become a buzz word of late with many businesses adopting it. By some estimates, up to 80% of the world’s data is unstructured, and hence not usable. Big data often involves combining different kinds of data from multiple sources, and analyzing it to discover patterns across the resulting large data set that can help in business decisions.
Using Hadoop for Big Data Projects
Gil Allouche, VP of Marketing at Qubole, a leading provider of Big Data services recommends Hadoop to implement Big Data solutions. Hadoop is the open source software project that enables distributed storage and processing of this data. Allouche stated “Hadoop offers two huge advantages for working with big data. One, it’s scalable to handle massive volumes of data allowing businesses to remove data silos and no longer pick and choose which data to store. Two, Hadoop is able to capture unstructured data, such as social media messages, opening up new possibilities for analysis and insight.”
Guy Harrison, Executive Director R&D, at Dell, also believes Hadoop offers a truly compelling advantage for Big data projects. He explained in more detail, “Firstly, it’s very economical: the cost of storing data in Hadoop is a fraction of the cost of storing it in most alternatives since it leverages open source software and commodity hardware. Hadoop also offers very little barrier to ingesting new data. Finally, Hadoop has a rich and growing ecosystem of tools for ingesting, managing and analyzing data. The capability of Hadoop is being bolstered all the time.”
Insights on getting started with Hadoop
However, before getting into a Hadoop project, there are some crucial aspects to consider. As with any other project, careful planning is required to ensure the Hadoop project generates meaningful data and insights for the business. Allouche offered some advice for businesses starting off with their Hadoop projects. “When working on your first Hadoop project, being able to demonstrate ROI quickly is important. Start with a small, achievable business objective that can have an immediate impact. Then expand your initiative from there.”
Abhra Mitra, Data Analytics Manager at OnDeck, shared practical insights from years of managing large Hadoop projects, and is openly biased towards Amazon’s platform. “Preferably use AWS to run your cluster. It makes it easier to grow your cluster later. Ensure that all your data is stored in a structured format (or at least semi-structured) for easy access and analysis. Note that de-normalized data is often better for applications like Hadoop streaming.”
Businesses like Yahoo, Microsoft, Facebook and Google are using Hadoop for their big data analysis
With the scale of the infrastructure requirements for Hadoop, new projects can get locked into an analysis paralysis, trying to align all the moving pieces. Harrison emphasized the need to balance it out: “Try to strike a balance between ‘analysis paralysis’ and ‘hadumping.’ Analysis Paralysis, or at least analysis delay, occurs when you spend too much time working on the business problem to be solved, only to find that when you are ready with the problem some of the data you need has already been discarded. To avoid this, it is best to start capturing as much data as you can early on. The more data you have the more problems you can solve and Hadoop storage is cheap.”
There are two main challenges that businesses often run into while working on Hadoop projects – infrastructure and expertise. “Many companies lack the expertise they need to implement a big data solution. Going with a big data in the cloud provider can alleviate this problem, but businesses will still need the skills to integrate Hadoop with other BI technologies in order to make the data accessible throughout the organization,” noted Allouche.
Harrison elaborated on these challenges. “Hadoop is really an ecosystem of open source projects that requires new skills and careful administration. Hadoop administration experts are in strong demand so usually most of your technical staff will be in learning mode as you build up the infrastructure. The other major challenge relates to analytic expertise; what we’ve come to call data science. Data scientists on a Hadoop project need a converged skill set including some knowledge of Hadoop tools such as Pig and Hive, strong statistical and data mining skills as well as an ability to understand enough business context to use these skills in service of business objectives.”
Mitra also emphasized the need for close security measures while using cloud based solutions, “…Ensuring that confidential data is stored securely on the cloud. This includes actual security (whitelisting IP addresses, encrypting data) as well as managing perceptions in the organization.”
Businesses like Yahoo, Microsoft, Facebook and Google are using Hadoop for their big data analysis. Cloudera claims that half of the Fortune 50 companies use open source Hadoop. With wide scale adoption by these early leaders it is only a matter of time before big data analysis becomes mainstream. Businesses need to establish the expertise to leverage the data to provide meaningful insights towards business strategy, Harrison concluded. “Data on its own doesn’t provide value. There does need to be a data scientist or analyst team working on extracting information from the data. As the old adage goes, “Data is not information, information is not knowledge, knowledge is not understanding.” It takes real skills in statistics and business problem solving to move from data to understanding.