How To Use Cache Sharding: Scale Out Neo4j
Neo4j is a high availability graph database that support ACID properties for transactions.
As per the CAP theorem, any system can provide only two of the properties from Consistency, Availability and Partition tolerance.
Just like relational databases, Neo4j has also chosen the Consistency and Availability options.
Consistency in Neo4j is guaranteed by supporting ACID properties to transaction.
Availability is achieved using high availability clustering that is supported by a master slave architecture with one master for all write operations and many slaves for read operations.
The architecture of Neo4j is fairly scalable, however it can only scale UP. For most of cases adding more nodes to the Neo4j cluster will be sufficient however if you have a really huge dataset adding node may not scale your system linearly.
In case you are looking for a scaling OUT option you may need to use your own technique to make best use of Neo4j resources. I recently attended GraphConnect developer tutorials session. I was fortunate enough to meet with the creators of Neo4j and discuss some of the options with them. One of the experts from Neo4j at GraphConnect suggested the option of Cache Sharding that many developers liked and wanted to get more understanding on it. I am trying to provide one example where we can use cache sharding and almost linearly scale a Neo4j Graph Database.
In fact, this technique of cache sharding is not really new to Neo4j. It has been used a lot with relational and other data storage as well.
What Is Sharding?
Sharding is a technique to horizontally partition your data. It helps scale out a application linearly by reducing the need of shared access. For example, lets say you have 100M records in your database table. If you want to shard it into 10 parts you can store 10M records on each server. When you need to store more data you can add more servers as the data grows.
Sharding implementations are complex. Sharding can also result in poor performance if done incorrectly. One must have good understanding of the data and its access methods to get best results.
Why Neo4j Does Not Do Sharding Yet?
Many Nosql databases inherently support sharding, but Neo4j does not. I got chance to speak with some experts at the GraphConnect and they mentioned that sharding graph database with a traditional sharding approach may not scale good enough for real time transactions. Each graph is different and therefore a naive sharding may not be good enough to get the best performance. At the same time, current Neo4j clusters are highly scalable for most of the needs. May be in near future, Neo4j will be able to figure out a smarter way of sharding that can make Neo4j even more scalable.
If you are looking for a graph database solution that supports sharding and can scale out for a real time transaction system you may want to check Titan Graph Database as well. Its an open source project and provides three different options of Nosql DB storage including (Cassandra, HBase and BerkleyDB)
How Can I Scale Out Neo4j Read Performance?
You can "almost" linearly scale out Neo4j reads using a technique called cache sharding. I used "almost" since the solution may heavily depend on the type of graph database and you may still see situations when it will not scale out linearly.
What Is Cache Sharding?
Cache sharding is a simple technique that will allow you to have each Neo4j server machine cache a part of the graph data. This can be possible since Neo4j implements its own caching on each cluster machine based on the transaction it receives.
You can put a load balancer in front of Neo4j cluster and shard your transactions based on some logic. This way one type of read transactions will always go to one cluster machine. That will enable Neo4j on this cluster machine to efficiently cache only a subset of the graph.
A Simple Example Of Neo4j Cluster With Cache Sharding
Lets take an example of a graph of people with 100 Million nodes
Lets assume we have 10 Machine Cluster of Neo4j. (m1,m2....m10)
Lets assume we implement a sharding based on GEO location (may be consistent hashing) to divide the transactions. So a transaction related to one city (say San Francisco) will go always to Machine m1 and a transaction for New York city will always go to Machine m2 and likewise for 10 different cities.
Once this sharding is implemented and transactions start flowing through each cluster machines. You will notice that machine m1 is going to cache more data related to city San Francisco whereas machine m2 is going to cache more data related to city New York.
Using this technique, and fine tuning it may result in a almost perfect situation where each machine on cluster is caching around 10M resulting in the whole graph being in memory and being served quickly.
This example is just to explain the concept of cache sharding. Using city or location as your sharding may not always be very efficient. Every graph is different and each domain may have different techniques to shard cache. However you can always find a different parameter that can divide your graph in multiple smaller sub graphs that can be separately cached.
Cluster And RAM Size Calculations For Scalable Reads
If you are planning to do cache sharding you may need to plan accordingly for the cluster and machine RAM size. Lets take a simple example.
Assume the total size of you graph database on hard drive is 100GB
If you can provide at least 10Gb RAM + (some extra) on each machine to Neo4j for caching
Than you may require at least 10 Machine cluster to be able to cache all the database in RAM.
Also keep in mind that you may need to plan in advance for scaling up, therefore keep some extra nodes and memory based on the speed of your data growth.
Limitations Of Cache Sharding
Below are some limitation you may observe based on your application and domain
- Neo4j has a single master node that handles all write transaction. The master node in cluster needs to keep working harder as the size of cluster grows. This will lead you into some limits since you have single node bottleneck.
- Each domain have different types of graphs. In case your graph is heavily connected and sharding is resulting in a completely different subgraph caching.
What About The Write Transactions?
With current Neo4j cluster architecture writes are required to be managed by one master node. This may be a potential bottleneck in case you have really high volume of write transaction in a really large cluster.
Scaling out a system like Neo4j is not always easy, however these techniques may help you improve the performance to some extent. Neo4j has chosen consistency and availability from CAP theorem therefore it will suffer with partition tolerance.
Many times a really huge graph database is consisting of many relatively smaller isolated graphs. Creating Multiple clusters with multiple isolated graph may help you scale out write transactions in such situations, however this may not be always possible. In case everything in your graph database can connect to everything you may face a potential bottleneck.
I hope this article will help you scale out your Neo4j cluster.