I’ve been studying big data technologies for one year now. I do this for 3 reasons:
- Big data is the future, starting at my profession ICT but mainly impacting businesses of all other segments, from governance to retailers, from FMCG manufacturers to healthcare. In short: everywhere. Cloud technologies are important, mobile is the most important user interface now and in the future, and social media will gain more importance for consumer and businesses but all applications are or will be driven by big data. If I want to continue my career in ICT I better build knowledge and some experience with big data, right?
- I manage already a graph database myself since 2009. It is build in the mind mapping application TheBrain and it is my own marketing database. One could consider the background technology of TheBrain as a graph database engine, but the functionality of TheBrain is focused on personal or group based knowledge managent. It doesn’t offer query functionality, it cannot be extended, it’s a mind mapping tool, not a scalable graph database engine. So I was playing around a long time with the idea to migrate to a real graph database. To support my first objective, studying big data, I decided to migrate my graph database to big data technology.
- You never know, maybe I am going to launch a startup myself. I have this business idea that could grow to another social media platform. The business model is scalable, it can grow towards multiple target audiences and multiple geographic regions. Most important: I haven’t seen an implementation of my business idea yet. But it needs big data technologies, another reason for my study.
In search of a Graph Database Engine
The core of the applications described in the previous chapter is based on a graph database. I am an enthusiastic regarding graph databases. I dare to say that graph databases are one of the most important developments in the NoSQL arena. All mayor social media organizations, including Google, Facebook and LinkedIn are using graph databases to support their business. My TheBrain database is also a graph database. I therefore want to store my information in vertices and edges and I was therefore in search of a graph database engine.
Neo4j, awesome graph database functionality but limited to one server
When I Googled “graph database” Neo Technologies with their Neo4j graph database immediately popped-up. I became acquainted with Neo4j when they were leaving their graph traversal API in favor of their Cypher query language, so I encountered twice a learning curve. But I noticed some excellent example graph database applications of Neo4j, including graph database visualizations. I learned to value Neo4j as a benchmark for storing and querying information in graph databases. Their textbook “Graph Databases” published at O’Reilly is still lying in my restroom. To consider Neo4j as a benchmark turned out to be a mistake as I learned later whilst studying Apache Spark GraphX. But although Neo4j is NoSQL, it is not big data. It is not horizontally scalable. Sure, I tried an implementation of Neo4j at Heroku, and I was able to scale vertically by growing the virtual machine. But big data is all about horizontal scalability and Neo4j couldn’t offer that flexibility. Secondly, Neo4j uses an ID formatted as a 64 bits Long to identify vertices and edges. 64 Bits offers a large addressing space but for my application I rather use an 128 bit identifier such as a GUID because I don’t know yet what addressing space I will need in the future. Later I discovered that also Apache Spark GraphX is using Long’s to identify vertices and within their edge definitions, limiting GraphX adressing space equivalent to Neo4j. However, within the Apache Spark development community an issue request has been registered to be able to use any field definition as a vertex identifier, including GUIDs, so my future addressing space problem would have been solved by apache Spark GraphX. Third, I want to stick to open source technologies and industry standards. I consider big data query languages such as Pig Latin and Hive as industry standards, they have been widely adopted by the big data community. Unfortunately Cipher hasn’t reached that status yet. Neo4j is not an option for my application.
Whilst I was in search of a graph database engine for big data applications I ran into Apache Spark. I became acquainted with Apache Spark when I was looking at Apache Mahout for machine learning. Apache Mahout stopped supporting Map/Reduce on Hadoop: Mahout turned to Apache Spark as their data storage layer. Apache Spark seemed to offer all functionalities I needed: SparkSQL, a Hive lookalike in Spark to query rows and columns, GraphX as a graph database engine and MLlib for machine learning. And with their latest release Spark Streaming is also included. I use already streaming in my application by an implementation of SpringXD. And Spark is horizontally scalable, suitable for big data, it offers fast response times because of in-memory computing and, most important of all, is it growing very fast as a coming industry standard in big data technologies. It sounded all very promising, what could I need more? So I decided to go for Apache Spark as the base technology for my application. So I managed to migrate and store my graph database in Apache SparkX, running on top of YARN in my Hadoop system. But I learned that Apache SparkX would not be suitable for my application for 2 reasons:
- Apache Spark is heavily based on transformations of data. It is still a batch process although it runs very fast in-memory. But my application will include an interactive web based user interface. Response times have to be fast and limited. I doubt if Apache Spark GraphX could offer me the necessary performance, it’s another ballgame. Publications on the Web confirm my doubts.
- Apache Spark GraphX is ment to execute graph statistical functions such as PageRank. It is not ment to store and query information such as names and addresses. It doesn’t offer a graph query language such as Neo4j’s Cipher. This was the mistake I made when I left Neo4j looking for a big data graph database. I made the assumption that all graph databases would support graph traversal and/or querying graph databases. It was a mistake: graph databases One can traverse a GraphX graph but the implementation is using sending and testing messages to adjacent vertices. I consider it as too complex for my application. I studied also other GraphX functionalities such as subgraph, map and join. They all looked promising for my application but I am missing in GraphX a type of “union” function to join 2 subgraphs.
- Apache Spark GraphX is based on Spark’s RDD’s. RDDs are immutable. I need mutable vertices and edges in my application. My thoughts were to solve this issue modifying the source data of the graph, for example in HBase because HBase can be used in Apache Spark as a data source. But still, immutable RDD’s do not fit in the architecture of my application.
Back to the future: Apache HBase
For purposes of my big data study I looked already at HBase so I did know already that major social media organizations are using HBase to store and query their graphs. And originally he only information in my graph is a list of vertices including their properties and a list of “from-to”edges including their properties. I could imagine that I could store this information in a HBase database. Why I didn’t store my information in a HBase database yet? Because my ultimate goal was Apache Spark, including GraphX, and Spark doesn’t make a distinction between source data. For my application HCatalog, on top of my Pig Latin ETL implementation, offered me enough functionality to store data for usage in Apache Spark and inherit information such as field names and types. My learning curve was already steep enough for learning HBase as an additional big data technology. Apache HBase offers me the graph functionality I need:
- It is able to store graph databases, others have proven it successfully.
- It supports mutable data, including multiple, time related, versions of that data which I could use in my application.
- It suits big data including online response times. As you know I need response times suitable for online applications on web and mobile.
- It’s an industry standard that is widely adopted.
I haven’t given thought yet to my HBase schema design, e.q. what I will store in the column families and how I will implement (indexed) row keys. Logically I also have to develop the logic to store and query data in my HBase table. I hope to find some example implementations in GitHub. At least I have confidence enough in HBase that it will fulfill my graph database requirements, including a further implementation of Apache Spark GraphX as a graph computing engine.