Recently, I read a very informative white paper which was published by TDWI’s Philip Russom. The research in this report was sponsored by some of the key BI players below – so it had significant backing.
I wanted to share the top 10 myths about Hadoop from TDWI’s report. I found them insightful and you may as well:
Credit: Integrating Hadoop into Business Intelligence and Data Warehousing by Philip Russom @ TDWI
- Fact #1: Hadoop consists of multiple products
We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.) The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL engine for low-latency data access to HDFS and Hive data.
- Fact #2: Hadoop is open source but available from vendors, too
Apache Hadoop’s open source software library is available from ASF at http://www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools, maintenance, and technical support. A handful of vendors offer their own non-Hadoop-based implementations of MapReduce.
- Fact #3: Hadoop is an ecosystem, not a single product
In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products (e.g., database management systems and tools for analytics, reporting, and DI) that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
- Fact #4: HDFS is a file system, not a database management system (DBMS)
Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a DBMS, such as indexing, random access to data, support for standard SQL, and query optimization. That’s okay, because HDFS does things DBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
- Fact #5: Hive resembles SQL but is not standard SQL
Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI believes that over time, Hadoop products will support standard SQL and SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.
- Fact #6: Hadoop and MapReduce are related but don’t require each other
Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.
- Fact #7: MapReduce provides control for analytics, not analytics per se
MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic and other applications—not just analytics.
- Fact #8: Hadoop is about data diversity, not just data volume
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS and related Hadoop products. After all, many types of big data that require analysis are inherently file based, such as Web logs, XML files, and personal productivity documents.
- Fact #9: Hadoop complements a DW; it’s rarely a replacement
Most organizations have designed their DWs for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement
DWs by handling the multi-structured data types most DWs simply weren’t designed for. Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adoptors offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel.
- Fact #10: Hadoop enables many types of analytics, not just Web analytics
Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer base segmentation, fraud detection, and risk analysis—can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.
Philip also did a nice job in this white paper in clarifying the status of current HDFS implementations. It is represented well in the graphic below.
To sort out which Hadoop products are in use today (and will be in the near future), this report’s
survey asked: Which of the following Hadoop and related technologies are in production in your
organization today? Which will go into production within three years? These
questions were answered by a subset of 48 survey respondents who claim they’ve deployed or used
HDFS. Hence, their responses are quite credible, being based on direct, hands-on experience.
HDFS and a few add-ons are the most commonly used Hadoop products today. HDFS is near the top of
the list (67%) because most Hadoop-based applications demand HDFS as the base
platform. Certain add-on Hadoop tools are regularly layered atop HDFS today:
- MapReduce (69%) for the distributed processing of hand-coded logic, whether for analytics or for fast data loading and ingestion
- Hive (60%) for projecting structure onto Hadoop data so it can be queried using a SQL-like language called HiveQL
- HBase (54%) for simple, record-store database functions against HDFS’s data
If this information has been helpful to you, check out the full report from TDWI below. How is your organization using Hadoop?
- Integrating Hadoop into Business Intelligence and Data Warehousing ( by Philip Russom @ TDWI)
- Busting 10 Myths about Hadoop (by Philip Russom)
- 5 reasons why the future of Hadoop is real-time (relatively speaking) (gigaom.com)