Why I left Apache Spark GraphX and returned to HBase for my graph database

Introduction

I’ve been studying big data technologies for one year now. I do this for 3 reasons:

  1. Big data is the future, starting at my profession ICT but mainly impacting businesses of all other segments, from governance to retailers, from FMCG manufacturers to healthcare. In short: everywhere. Cloud technologies are important, mobile is the most important user interface now and in the future, and social media will gain more importance for consumer and businesses but all applications are or will be driven by big data. If I want to continue my career in ICT I better build knowledge and some experience with big data, right?
  2. I manage already a graph database myself since 2009. It is build in the mind mapping application TheBrain and it is my own marketing database. One could consider the background technology of TheBrain as a graph database engine, but the functionality of TheBrain is focused on personal or group based knowledge managent. It doesn’t offer query functionality, it cannot be extended, it’s a mind mapping tool, not a scalable graph database engine. So I was playing around a long time with the idea to migrate to a real graph database. To support my first objective, studying big data, I decided to migrate my graph database to big data technology.
  3. You never know, maybe I am going to launch a startup myself. I have this business idea that could grow to another social media platform. The business model is scalable, it can grow towards multiple target audiences and multiple geographic regions. Most important: I haven’t seen an implementation of my business idea yet. But it needs big data technologies, another reason for my study.

In search of a Graph Database Engine

The core of the applications described in the previous chapter is based on a graph database. I am an enthusiastic regarding graph databases. I dare to say that graph databases are one of the most important developments in the NoSQL arena. All mayor social media organizations, including Google, Facebook and LinkedIn are using graph databases to support their business. My TheBrain database is also a graph database. I therefore want to store my information in vertices and edges and I was therefore in search of a graph database engine.

Neo4j, awesome graph database functionality but limited to one server

When I Googled “graph database” Neo Technologies with their Neo4j graph database immediately popped-up.  I became acquainted with Neo4j when they were leaving their graph traversal API in favor of their Cypher query language, so I encountered twice a learning curve. But I noticed some excellent example graph database applications of Neo4j, including graph database visualizations. I learned to value Neo4j as a benchmark for storing and querying information in graph databases. Their textbook “Graph Databases” published at O’Reilly is still lying in my restroom. To consider Neo4j as a benchmark turned out to be a mistake as I learned later whilst studying Apache Spark GraphX. But although Neo4j is NoSQL, it is not big data. It is not horizontally scalable. Sure, I tried an implementation of Neo4j at Heroku, and I was able to scale vertically by growing the virtual machine. But big data is all about horizontal scalability and Neo4j couldn’t offer that flexibility. Secondly, Neo4j uses an ID formatted as a 64 bits Long to identify vertices and edges. 64 Bits offers a large addressing space but for my application I rather use an 128 bit identifier such as a GUID because I don’t know yet what addressing space I will need in the future. Later I discovered that also Apache Spark GraphX is using Long’s to identify vertices and within their edge definitions, limiting GraphX adressing space equivalent to Neo4j. However, within the Apache Spark development community an issue request has been registered to be able to use any field definition as a vertex identifier, including GUIDs, so my future addressing space problem would have been solved by apache Spark GraphX. Third, I want to stick to open source technologies and industry standards. I consider big data query languages such as Pig Latin and Hive as industry standards, they have been widely adopted by the big data community. Unfortunately Cipher hasn’t reached that status yet. Neo4j is not an option for my application.

Apache Spark

Whilst I was in search of a graph database engine for big data applications I ran into Apache Spark. I became acquainted with Apache Spark when I was looking at Apache Mahout for machine learning. Apache Mahout stopped supporting Map/Reduce on Hadoop: Mahout turned to Apache Spark as their data storage layer. Apache Spark seemed to offer all functionalities I needed: SparkSQL, a Hive lookalike in Spark to query rows and columns, GraphX as a graph database engine and MLlib for machine learning. And with their latest release Spark Streaming is also included. I use already streaming in my application by an implementation of SpringXD. And Spark is horizontally scalable, suitable for big data, it offers fast response times because of in-memory computing and, most important of all, is it growing very fast as a coming industry standard in big data technologies. It sounded all very promising, what could I need more? So I decided to go for Apache Spark as the base technology for my application. So I managed to migrate and store my graph database in Apache SparkX, running on top of YARN in my Hadoop system. But I learned that Apache SparkX would not be suitable for my application for 2 reasons:

  1. Apache Spark is heavily based on transformations of data. It is still a batch process although it runs very fast in-memory. But my application will include an interactive web based user interface. Response times have to be fast and limited. I doubt if Apache Spark GraphX could offer me the necessary performance, it’s another ballgame. Publications on the Web confirm my doubts.
  2. Apache Spark GraphX is ment to execute graph statistical functions such as PageRank. It is not ment to store and query information such as names and addresses. It doesn’t offer a graph query language such as Neo4j’s Cipher. This was the mistake I made when I left Neo4j looking for a big data graph database. I made the assumption that all graph databases would support graph traversal and/or querying graph databases. It was a mistake: graph databases One can traverse a GraphX graph but the implementation is using sending and testing messages to adjacent vertices. I consider it as too complex for my application. I studied also other GraphX functionalities such as subgraph, map and join. They all looked promising for my application but I am missing in GraphX a type of “union” function to join 2 subgraphs.
  3. Apache Spark GraphX is based on Spark’s RDD’s. RDDs are immutable. I need mutable vertices and edges in my application. My thoughts were to solve this issue modifying the source data of the graph, for example in HBase because HBase can be used in Apache Spark as a data source. But still, immutable RDD’s do not fit in the architecture of my application.

Back to the future: Apache HBase

For purposes of my big data study I looked already at HBase so I did know already that major social media organizations are using HBase to store and query their graphs. And originally he only information in my graph is a list of vertices including their properties and a list of “from-to”edges including their properties. I could imagine that I could store this information in a HBase database. Why I didn’t store my information in a HBase database yet? Because my ultimate goal was Apache Spark, including GraphX, and Spark doesn’t make a distinction between source data. For my application HCatalog, on top of my Pig Latin ETL implementation, offered me enough functionality to store data for usage in Apache Spark and inherit information such as field names and types. My learning curve was already steep enough for learning HBase as an additional big data technology. Apache HBase offers me the graph functionality I need:

  1. It is able to store graph databases, others have proven it successfully.
  2. It supports mutable data, including multiple, time related, versions of that data which I could use in my application.
  3. It suits big data including online response times. As you know I need response times suitable for online applications on web and mobile.
  4. It’s an industry standard that is widely adopted.

I haven’t given thought yet to my HBase schema design, e.q. what I will store in the column families and how I will implement (indexed) row keys. Logically I also have to develop the logic to store and query data in my HBase table. I hope to find some example implementations in GitHub. At least I have confidence enough in HBase that it will fulfill my graph database requirements, including a further implementation of Apache Spark GraphX as a graph computing engine.

Working in ICT since the 90's in multiple domains including network infrastructures and protocols, software application development, office automation, enterprise resource planning / business process automation and cloud solutions. Current professional focus: Big data and machine learning. Personal interest: global economics.

Tagged with: , ,
Posted in Apache Spark GraphX, HBase
14 comments on “Why I left Apache Spark GraphX and returned to HBase for my graph database
  1. etchevarryxm says:

    did you tried http://tinkerpop.incubator.apache.org/ for graph transversal?

    • lbartkowski says:

      Nope, but thanks for your reply. I have chosen Titan as the graph database engine. Titan is based on Tinkerpop. So

      To my opinion the future belongs to big data and it’s technologies, from HDFS to driverless cars. Including in this disruptive development is to my opinion the victory of open source software technologies over commercial/proprietary software technologies. To my opinion Apache is the leading source of open source software technologies, so I will therefore seriously invest time if and how Apache Tinkerpop, compared to Titan, can help me to reach my objectives. Thanks.

      Sorry for the delay in my reply.

  2. ABenhenni says:

    You should have a look at Titan : https://thinkaurelius.github.io/titan/

    It is from the guys who made the tinkerpop stack and the gremlin query language. It offers graph abstraction overs different “Big Data” storage, like HBase and Cassandra. If you go with HBase, you could as well add titan.

    • lbartkowski says:

      Yes, Titan is the choice for my prototype. It runs already on my development system, including HBase storage. I learned that tinkerpop is the source of many developments, including Neo4j. I have studied Titan and Gremlin and I am convinced that it fits my functional requirements.

      In fact I was aware of Titan when I wrote this post. And I knew already that I would use Hbase as the Titan storage layer. But I didn’t studied Titan/Gremlin a lot and I didn’t know if I would run into functional/technical issues like I ran into Apache Spark. But I have seen Habse graph implementations so I was sure about Hbase. I have spend free time for months to study Apache Spark and I wanted to find a solution/technology ASAP. But it will be Titan. I am convinced.

      Mind you, I won’t abandon Apache Spark, I consider it still as the statistics/graph computing solution of choice.

      Sorry of the delay of my reply. I really like your reply and the reply of others. Stackoverflow doesn’t like discussions. But I need discussion to learn faster. I have already learned from all your replies,many thanks to you all.

  3. Victor says:

    Hi, have also a look on Cassandra+Titan graph database, it can be a good fit for your needs as you can also use Spark for batch processing on Cassandra data table.
    http://thinkaurelius.github.io/titan/

    • lbartkowski says:

      I have thought about using Cassandra. However, I am using the HortonWorks Sandbox as my Hadoop implementation. That distribution comes with Hbase, including Ambari management.

      In order to realize my objective building a graph database using big data technologies I am still running up a steep learning curve including HDFS, Pig, Hive, Scala, Spark and now Groovy and Gremlin. I am not a developer, it is not my professional job and it has never have been. You could imagine that I am at “rookie” level on all technologies.

      I want to reach my end-goal “a working prototype”, defined as a complete implementation of the software stack (from HDFS files to web services consumed from a Javascript based front-end) and a working end-to-end process implementation (from ETL to GUI), as soon as possible. My Hadoop implementation is based on HortonWorks Sandbox. HortonWorks does an excellent job regarding hadoop and related technologies. It ships with Hbase and Ambari. Adding Cassandra would add another learning curve.

      But you are right, I have studied Cassandra,because it’s relationship to Titan, and Cassandra seems to offer equivalent functionalities as HBase. I am working on a prototype. Maybe I find some investors. Maybe I have the money to hire specialists to evaluate the architecture and maybe Cassandra would be the choice for a pilot or production, you never know. I am not an opponent of Cassandra, I just don’t know enough about it to become a supporter.

      Hope you liked the reply. Sorry for the delay, blogging is also not a daytime job.

  4. Victor says:

    I understand, so much new technologies and not possible to get and try everything. Perhaps just keep in mind this : Cassandra is more for transaction/OLAP applications (replacing Oracle) whereas HBase (with HDFS) can be more seen as “Data warehouse”.

    • lbartkowski says:

      Thanks Victor for your reply. I haven’t seen any Internet publication confirming your statement. Apache Hbase itself states on it’s homepage: “Use Apache HBase™ when you need random, realtime read/write access to your Big Data. That sounds to me as (almost) OLTP on big data, doesn’t it? Regarding Cassandra: It might be the choice for a production environment, for for learning curve reasons I will stick to Hbase for my prototype. Please find more information in my other replies to replies on my post.

  5. lbartkowski says:

    Reblogged this on Luc Bartkowski's Blog and commented:

    I thank you all for you replies. Regarding the replies about Cassandra: Yes, Cassandra is an option as storage back-end for Titan. But Titan and Hbase will be my choice for my prototype because of learning curve limitations. What I hope but didn’t prove yet is that I will be able to query HBase using NoSQL and make sense of the Titan database model in Hbase. Graphs offer great advantages to the relational model, but many times (No)SQL offers simplicity and results.

    Currently I am at the beginning of the learning curve of Titan and Groovy. I haven’t implemented my graph database in Titan yet. I have studied the blogs about importing data. For my database I should write a Groovy script. My RIPL Groovy in a terminal is already working but I would like the assistance of an IDE like IDEA InteliJ. Finding out how to connect my Apple OSX based InteliJ installation to the remote JVM of my Virtualbox/Ubuntu/HortonWorks/java environment is still progressing..

    Another issue I am thinking off is the design of my Titan based graph database. Labels of vertices cannot be altered. But labels improve search speeds, etc. So what attributes to choose to implement a label? In my logical database model I have 2 types of entities/vertices: master data and production data. Distinction on label level on “master”/”production”level will not offer me the advantages of labels. Therefore I would like to implement labels on a lower level of granularity but what to use?
    Secondly: Suppose I define on master data level two types of organizations, “public” and “private” than the implementations of these organization types in into 2 master data vertices would lead to many edges towards production data about organizations and their relation/Edge to their master data organization vertices. It would lead to too many edges reducing the capacity of the graph database in Titan. I have to give my data model many thoughts.
    Still I am convinced about my master data/production data concept. It will offer me the model to implement my functionalities. Even the back traversal possibilities of Gremlin I find interesting regarding the master data/production data option. But although we have a GUID for coding global objects I haven’t found a global classification system. So I am still thinking about my master data/production data model regarding Titan labels. So I have still to implement my own master data/production data model.

    I realize that my personal and free time capacity might be not enough to implement a full software stack and an end-to-end process in a prototype myself.
    I have therefore invested a few hundred bucks in the first release of ebrain.technology. They offer cognitive web browsing. Expanding my graph database using this AI functionality is worth the investment. y ebrain.technologies promises to offer the user a team of eHumans that would improve the knowledge/capacity of the user So I have bought a software license. I will point an eHuman to my bookmarks/favorites of my webbrowser regarding big data and software developments. These bookmarks are becoming a database in itself.
    For example java. Java is the lingua franca of the open source software development society. Social Media vendors like Facebook, Twitter and inkedIn offer example code to use their API in java code. Languages such as Scala and Groovy are based on the JVM. Java is a mandatory standard. Period.
    So I hope that I can train a brain.technologies eHuman to a java developer using “it’s” cognitive capacity. I will pont the eHuman to sites with java syntax and semantics information, but also to sites bringing software architectures, design patterns, Git best practices and Stackoverflow Q&As regarding the tag java. I hope to train the eHuman to produce java code. At least I expect that an eHuman will bring me better and faster answers than myself using Google and the search engines of technology sites regarding java software development.

    And if I have implemented an eHuman to a satisfying level I will use the same approach to develop a team of eHumans, one for java but also one for Perl scripting, and one for Hbase NoSQL and one that etc.
    So I am currently studying brain.technologies eBrain as much as possible. Trying to understand their 100+ screenshots. Their release of “retail” version 1 is expected on May 25th. Can’t wait to download.
    Unfortunately the first release of ebrain.technologies is for Windows. I have invested in an OSX based iMac with a 6 core processor and 32GB of memory and a 3TB disk to assist me in my big data journey, I guess I will have to buy a Windows 8.2 or 10 license/CD that I will have to install in a VirtualBox virtual machine. (emitting a sigh). I do have a company laptop running Windows and I am entitled to install eBrain on 3 interconnected machines but I fear that the capacity of the company laptop is not enough for satisfactory running eBrain.

    So my current focus is Artificial Intelligence (AI) and specifically the eBrain implementation of cognitive computing and learning and decision making. I feel that eBrain offers me the next level to improve myself, at least at the level of system design and development.

    I will keep you posted. Thanks for your interest. Reply if you feel so.

  6. […] Bartkowski explains how he tried to use Spark GraphX for his graph database, but ended up using plain old […]

  7. varun says:

    have you tried out blazegraph . Which is horizontal scalable graph database.

    • lbartkowski says:

      Hi Varun, thanks for your comment. Sorry for my late reply, I had a traffic accident, I’m still recovering.
      To answer your question: No, I didn’t. I returned to Neo4j which I know from the past. Their Cypher language enabled me to resture a flattened graph model in XML. Currently I focusing on the next layers in the software stack beyond the database. From that perspective Neo4j offers me a next advantage by their support of the Spring Framework in java. Greetings, Luc

  8. GraphX is available only in Scala. You can have a look at Graphframe, where they are looking for Dataframes,( Java, Python, Scala)– instead of low-level RDD. It has few more advantages as it can make use of query optimizer Catalyst, Tungsten project optimization.

    You can convert GraphX TO Graphframes and vice versa. VertexIds are of any type, unlike GraphX only Long. Return type is Dataframe or Graphframe but in GraphX only Graph[VD,ED], RDD.

    Even at one point of time, I think HBase is a good choice for datastore. Let the compute be Spark. Check nicely , how to model m:1 , m:m , hierarchical storage. It is interesting.

    • lbartkowski says:

      Hi Raja, thanks for you comment. Sorry for the late reply. I’ve experimented with HBASE but I returned to graph technology for my application. NoSQL is currently of more important than big data for my application so to speak. My graph database is the core of my application. I’ve imported it into Neo4j. Currently I’m focussing on the next layers in the software stack to implement that graph functionality in java. Neo4j offers another advantage in this area with their support of Spring Data to bootstrap application development. But again, thanks for your comment. Greetings, Luc

Leave a reply to Victor Cancel reply