To my opinion big data is the most important development in the ICT industry that will have the biggest impact on businesses and society. I dare to say that big data will be as impactful and disruptive as the introduction of Internet at the end of the last century. But the longer I dig into the world of Big Data, the more it is clear to me that industry standards aren’t settled yet.
One might say “that statement is an open door you are trying to kick in”. That might be true but let me take you back to the early days of RDBMS engines from organizations such as Oracle, Sybase, Informix, Microsoft, Ingress and Progress. All vendors supported from day 1 the same relational model including a dialect of ANSI/ISO SQL. The SQL “select” statement is universal, also in those days. I remember also that one vendor supported triggers or stored procedures in their database whilst others not. Still there was a kind of industry standard which is still applicable today. This is not the case regarding big data.
Sure, Hadoop is mainstream technology regarding big data, including distributions like from Hortonworks, Cloudera and Pivotal. But there is new life beyond Hadoop like Apache Spark: Same kind of distributed big data computing, offering features like streaming, machine learning, a SQL interface, and even a graph database. Hadoop’s functionality might be reduced in the future to one of its core functionalities: HDFS, large file storage.
Personally I love graph databases, because it manages relationships between entities. Apache’s Spark GraphX is using Bagel, a Pregel dialect. Mind you, graph database companies like Neo4j offer a useful noSQL graph database engine, although their query language Cypher is yet another implementation of a DDL and DML for graph databases, let go Facebook’s API for their own graph database. There is currently no standard for a noSQL graph databases or related languages.
To support streaming data for big data applications multiple solutions exists such as Apache Spark and SpringXD. Also in this case: same kind of streaming data capture, (pre)processing and output options but completely different implementations of the same functionality.
Regarding machine learning (ML): Apache Spark brings MLib, whilst I was focussing to learn Apache Mahout. MLib on Sparc appears to run faster than Mahout on Hadoop using Map/Reduce. But today Mahout is also ported to Spark. So which ML implementation should one use, which architecture is future proof?
My conclusion: The open source big data projects and big data technology vendors are still striving for market dominance and haven’t settled yet for industry standards. Big data is still too much in a development stage although many applications are already running in production.
Regarding graph databases: I forecast that technologies like GraphX will eventually conquer the graph database market because the data doesn’t need to be Structured, which is to my opinion the most important feature of noSQL.