Lack of industry standards for Big Data

To my opinion big data is the most important development in the ICT industry that will have the biggest impact on businesses and society. I dare to say that big data will be as impactful and disruptive as the introduction of Internet at the end of the last century. But the longer I dig into the world of Big Data, the more it is clear to me that industry standards aren’t settled yet.

One might say “that statement is an open door you are trying to kick in”. That might be true but let me take you back to the early days of RDBMS engines from organizations such as Oracle, Sybase, Informix, Microsoft, Ingress and Progress. All vendors supported from day 1 the same relational model including a dialect of ANSI/ISO SQL. The SQL “select” statement is universal, also in those days. I remember also that one vendor supported triggers or stored procedures in their database whilst others not. Still there was a kind of industry standard which is still applicable today. This is not the case regarding big data.

Sure, Hadoop is mainstream technology regarding big data, including distributions like from Hortonworks, Cloudera and Pivotal. But there is new life beyond Hadoop like Apache Spark: Same kind of distributed big data computing, offering features like streaming, machine learning, a SQL interface, and even a graph database. Hadoop’s functionality might be reduced in the future to one of its core functionalities: HDFS, large file storage.

Personally I love graph databases, because it manages relationships between entities. Apache’s Spark GraphX is using Bagel, a Pregel dialect. Mind you, graph database companies like Neo4j offer a useful noSQL graph database engine, although their query language Cypher is yet another implementation of a DDL and DML for graph databases, let go Facebook’s API for their own graph database. There is currently no standard for a noSQL graph databases or related languages.

To support streaming data for big data applications multiple solutions exists such as Apache Spark and SpringXD. Also in this case: same kind of streaming data capture, (pre)processing and output options but completely different implementations of the same functionality.

Regarding machine learning (ML): Apache Spark brings MLib, whilst I was focussing to learn Apache Mahout. MLib on Sparc appears to run faster than Mahout on Hadoop using Map/Reduce. But today Mahout is also ported to Spark. So which ML implementation should one use, which architecture is future proof?

My conclusion: The open source big data projects and big data technology vendors are still striving for market dominance and haven’t settled yet for industry standards. Big data is still too much in a development stage although many applications are already running in production.

Regarding graph databases: I forecast that technologies like GraphX will eventually conquer the graph database market because the data doesn’t need to be Structured, which is to my opinion the most important feature of noSQL.


Working in ICT since the 90's in multiple domains including network infrastructures and protocols, software application development, office automation, enterprise resource planning / business process automation and cloud solutions. Current professional focus: Big data and machine learning. Personal interest: global economics.

Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: