Hadoop – facts and myths


When it comes to Big Data, Hadoop comes into picture – having an unique position within data platform in processing/managing multi-structured data for analytics and visualisations. Business Intelligence and Data Warehousing are key strategies to manage data platform, so as Big Data is important in day-to-day life. On the lighter side this is what and how Big Data is understood (source: Dilbert.com):

 

Apache Hadoop is the key turner in data world by making Hadoop as an open source software project (refer to Apache Software Foundation (ASF) for more information). Hadoop is a collection of technologies and software, the family is Hadoop Distributed File System (HDFS), MapReduce, Hive and HBase. Being they are open source, the vendors like Cloudera, Hortonworks & MapR have integrated their enhancement products that elevates Hadoop’s role in Big Data world. So next time when anyone says they are working in Hadoop, make sure you understand which part of Hadoop they are working in specific to vendors. Based on my understanding and collection (from web) here is what I would like to share about myths and facts about Hadoop family:

 

 

  • Hadoop is a technology: an open source technology from ASF and consists a collection of software libraries such as – HDFS, MapReduce,HBase, HCatalog, Ambari, Flume, Pig, Mahout and Impala. There many other projects are coming up with ASF within BI/DW space.
  • Hadoop is an open-source: this means any vendor can enhance Hadoop’s capability by adding their own enterprise-ready features and administrative/management tools. They major players are Hortonworks, Cloudera and MapR. Not just them Google, Facebook, LinkedIn and Amazon do have their own best-of features added surrounding Hadoop family. So we can call these products as Hadoop distributions or a platform of choice.
  • Which is best distribution: these 3 major players have their own best and worst comparisons between them. I suggest to look at this Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions, though its an old link but still valid. This is what an ecosystem necessary for any data platform solution, not a simple database management system.
  • Is it an ecosystem or database management: as we have few references above Hadoop ecosystem has enormous list of features that can go beyond traditional Data warehousing and Business Intelligence managements systems. So we can add HDFS to existing DBMS taking an advantage of distributed file system to obtain scalability & performance on huge volumes of data. So you choose your favorite SQL to query data across this distribution.
  • Where is SQL in Hadoop: when we talk about DBMS, the query language plays important role to query/access data across the distribution. Few players among this are are:Hive and SQL. MapR, SparkSQL, Impala and HAWQ for Pivotal. Then what about analytics in this distribution, to crunch and visualise the data.
  • What tools are available to crunch the big data: there are many tools available to crunch the data, MapReduce is one of the best control to manage analytics to handle fault tolerance and complex processing with logic, which is used in vendor products. You can search about Analytics with Hadoop within your favourite search engines that will draw list of links for your own amusement. So will Hadoop replace tradition BI/DW tools?
  • Hadoop is here to provide range of different features: when Hadoop is mentioned everyone believe it is to use with huge data volumes, it is true to some extent being in this social networking and data explosive times. Think about your personal use of social networking, data foot print that you leave and how best organisations can use it best for their business growth. Having said that relation data methods will not die that simple, but managing multi-structured data is essential to complement how best the architecture can help use to adopt huge workloads without sweating.

Along with above text I would like to refer few good links as well: