What is Big Data – brief notes and technology players?

Data, data and data everywhere.

Relational and Analytical data forms have been here for a very long time, now the new form of data has evolved which is Big Data. A natural evolution of data, more efficient ways to process large data sets is important, it needs a framework. The basic concept of Big Data characteristics are 5 Vs, Volume, Variety, Velocity, Veracity and Value (the fifth V is very important). These are mostly self-explanatory, Volume is nothing but amount of data generated every second, Variety is different forms & format (structured, semi-structured and unstructured) of the data, Velocity is the speed at which data is originated (trending), Veracity is the new term cropped up from social media regions (very much unstructered) and Value on its own terms is the ability to use the data into business value (revenue).

The framework is more important to ensure how you can integrate existing data platform with newest addition of Big data capabilities, determine the requirements each line of business (LOB) and illustrate use cases with real-time scenarios by combining hardware related platform capabilities, consider Big Data platform as a separate entity. No industry is an exception to this kind of activity. Business value driven with choosing affordable technologies, turn the data into value for business & data assets. Not with data platform, next part of such a successful implementation relies upon how quickly visualisation and data movement is provided.

Keeping aside the relational database platform biggies such as: IBM, Oracle and Microsoft, open-source is where put its foot-hold by helping efficient ways to analyze to use massive disparate datasets. In particular it is  essential to mention about Apache Hadoop, which has a robust framework to process large data sets across clusters, easy to scale-up and distributed across without need of high-end sprwal of hardware. Being its an open-source the standard is fairly straight forward, but not an out-of-box implementation that every organisation might implement. There are several restrictions in methods and to develop to enable that framework as enterprise-ready, keep an eye on data management & data governance.

Taking the advantages of open-source these are recognised (industry standard) Hadoop distributions: Hortonworks (HDP), Cloudera (CDP) and MapR.

Here is a clear representation of these three players (source: Hadoop Buyer’s guide):


Each of these technologies have their own advantages (upper hand) and way-around to tackle certain business based needs, I’m not taking anyone’s side here but it shows MapR has a lead here compared to other two. The natural way of data distribution within Hadoop is traditional key-value store on the base along with Hbase version (columnar) that goes aligned with OLAP based capabilities, hence Hadoop is used in most of online streaming applications and manages document-oriented databases without a hitch, that isn’t the case with relational & datawarehouse worlds.

The real strengt of implementing a new data platform relies upon what is important for your business growth, technical support surrounding existing systems and functionality that can leverage enormous featuers from Hadoop ecosystem. Every technology has its own do’s & donts’ in addition to strengths and weaknesses, what we need to see is exact distribution of processing by balancing cost of implementation and risk of introducing somethin new to the developers without having a proper training/learning opportunities.

Every one of them CDP, HDP & MapR are here to help to choose relevant distribution to embrace Hadoop, for me the 3 important aspects are:

  • Management and administration tools
  • SQL-on-Hadoop offerings
  • Performance benchmarks

Lately, Microsoft joins this side of world with HDInsight which is based on Apache Hadoop framework and mixture of integral BI & Analytical technogy range from the Microsoft.

Checkout Hortonworks for Windows – bringing Apache Hadoop to Windows.