10 Hadoop Alternatives - When To Use Hadoop
Using big data technologies for your business is a really a attractive thing, and Apache Hadoop makes it even more appealing now a days,
Hadoop is a massively scalable data storage platform that is used as foundation for many big data projects. Learn more about Hadoop at Hadoop FAQ & Hadoop best books pages.
Hadoop is powerful, however it has a steep learning curve and significant investments required from the company in terms of time and other resources.
Hadoop can be really a game changer for your company if applied the right way, however there are plenty of ways it can go wrong.
On the other hand many of the businesses (unlike Google, Facebook or Twitter) are not really having "The Big Data" that needs a huge hadoop cluster to analyze things, however the Hadoop buzzword is attracting them.
As David Wheeler said "All problems in computer science can be solved by another level of indirection". Hadoop offers one such level of indirection; As a software architect, it may be really difficult to take the right decision when your top management is biased about some buzzwords.
In this article, I am trying to suggest some non map reduce alternative approaches that should be tried before making an investment into Hadoop. You may also want to check out other MapReduce implementations as Hadoop Alternatives - Part 2 .
When To Use Hadoop
Here are three simple use cases when to use Hadoop for doing big data analysis.
- Size Of Overall Data is too large and can not be handled by current systems.
- Speed Of Data Growth (Growth Velocity) is high and can not be managed by current systems.
- Very Complex Data Reporting and Analysis from multiple structured and non structured data sources.
- Data processing can be done offline (not real time)
Use Cases For Hadoop : When To Use
- Large data size use cases are very obvious. Websites like facebook, twitter and google face a large volume of user visits and their database contains several terabytes of user data including user information, files, images and videos.
- Speed of data growth use case can be very common in online trading systems where billions of transactions are processed within hours. Common example is online trading websites and stock exchanges.
- Very Complex Data Reporting and Analysis - These use cases are often applicable for fraud analysis. The available structured data is often not sufficient for doing fraud analysis and therefore you require a lot of non structured data that can be difficult to analyze using relational systems.
- There can be many scenario where the data processing can be done offline and does not require user interaction. Biggest example is friends suggestion on all social networks. You need not calculate a potential friend immediately, this data can be processed offline and aggregated. Facebook, twitter and Linkedin use this technique to suggest you friends.
When Not To Use Hadoop
Here are the same obvious use cases when not to use Hadoop for doing big data analysis. We discuss each of these in later part of the article.
- Size Of Overall Data is handled by current systems.
- Speed Of Data Growth (Growth Velocity) can be managed by current systems.
- Simple Data Reporting and Analysis Use cases
Use Cases When Not To Use Hadoop
- Content websites, blogs and small eCommerce stores need not use Hadoop and big data processing. Even worlds most visited blog sites and news website may not have need to process big data if they are just serving static content.
- Some companies have years of data that is sitting in one database. The data size has grown too big over a period of time, however the data growth is not high. Such use cases should focus on finding out best way to use relational systems before moving to hadoop like systems.
Know Your Data
Lets discuss how we can avoid using Hadoop and make best use of existing systems and resources.
Size Of Overall Data
Hadoop is designed to work efficiently on large data sets. Just to give an idea,
- Hadoop has a file system (HDFS) which likes files sizes of several gigabytes. Therefore, in case you have your data file size in few megabytes, its recommended that you combine (zip or tar) several files into one file to make it in range of hundreds of megabyte or few gigabyte.
- HDFS Splits the files are stores it into blocks size of 64MB or 128MB or bigger.
If you have a data set which is relatively small then it may not be the best use of a giant ecosystem of hadoop. This requires you to start understanding your data even more, analyze what types of queries are needed and see if your data is really BIG.
On the other hand, just measuring data by the size of your database may be deceptive since your computations may be larger. At times you may be doing more of a mathematical computing or analyzing the permutation of a small data set that can be much larger than actual data. Therefore the key is to “Know Your Data, and know it well”.
Speed Of Data Growth (Growth Velocity)
You may have a several terabytes of data in your data warehouse or various other data sources. However, one very important factor you must consider before you start creating a Hadoop Cluster is the growth of that data.
Ask some simple questions to your data analysts like
- How fast is this data growing? Is this data growing with very fast pace?
- What will be the size of data after few months or years from now?
Several companies have grown the data over years not days or months. In case your data growth is not really fast enough, I would suggest considering options of Archiving and Purging (described later in this article) instead of direct jumping on to creating a Hadoop cluster.
Complexity of Data Reporting and Analysis
At times the overall data may not be at the scale of terabytes, however the reporting needs are so complex that it requires big data process. Most mathematical calculation would fit in this use case. However there can also be complex business reports that can require big data analysis. These use cases are limited however you may be easily able to spot them.
How To Reduce Your Data
If you think you really have a large data, you may want to consider some of the below approaches for reducing your data into relatively manageable size. Following options have been successfully used in industry for several years.
Data archiving is the process of moving stale data to a separate data storage for long-term retention (if required).
This requires a real good understanding of you data, and use cases for your applications. Big data processing eCommerce companies keeps order details of 3 months in a live database, however older order information is kept in a separate storage.
This approach can also be applied to your data warehouse. You can keep more recent data in a storage for faster reporting and querying. The less frequent used data can be on a different storage device.
Consider Purging Data
At times we are busy collecting data and not really sure that how much we should keep. If you are storing a lot of data that may not be useful it is going to make your recent data processing slower. Understand your business needs and see if older data can be removed and trends from that data can be stored instead. This will not only save you space but also help you speed up on analyzing recent data.
One common best practice for this is having standard columns in your data warehouse like created_date, created_by, update_date, updated_by. Now looking at these columns create a daily/monthly cron job that can purge the data in your data warehouse after the duration you do not want to see the data in your data warehouse. The logic of purging data may vary based on your domain therefore give some thought to it before implementing it.
If you are using a archiving tool, it can also be easily configured to purge data on archive.
All Data Is Not Important
You may be tempted to keep all the data you have for your business. There are variety of sources your data may be coming in e.g. log files, live transaction, vendor integration, ETL job, marketing campaign data etc. You should know that, not all data is business critical and may not be helpful keeping in a data warehouse. Filter unwanted data at the source itself, before even it gets stored in your data warehouse. Go mad about each and every column you are storing in your database tables and analyze if you really need that.
Be Mindful Of What You Want To Collect As Data
Lets say you are into a online video editing business. Would you like to keep all the changes made by your users on each video? This can be a huge volume. You may need to think about storing meta data in cases when you feel your data warehouse may not be able to handle it. Video editing is a really hypothetical example, however it may be applicable to a lot of other information that is related to the data you are storing.
In general, if you have some relational data there is a chance that you may get it from multiple sources and not all of it needs to be stored in your data warehouse.
Hire Analysts Who Understand The Business
By now you may have understood that knowing the data is really important for managing it efficiently. Trust me, this step is going to help you even more when you think I have tried all these things and its time we move to a big data solution like Hadoop.
Hadoop will be almost useless if you data analysts do not understand what to extract out of it. Invest in people who understand business. Encourage them to do experiments and learn new ways to look at the same data. Figure out what can be low hanging wins with existing infrastructure.
Use Statistical Sampling For Decision Making
Statistical sampling is one of the very old techniques used by researchers and mathematicians to extrapolate reasonable conclusions for a large data.
By conducting a statistical sample, our volume can be cut down immensely. Instead of tracking billions or millions data points, we only need to randomly pick a few thousands or hundreds.
This technique will not provide accurate results, however it may be used for getting a high level understanding of a large data set.
Scaling TechniquesHave You Really Hit The Edge Of Relational Database Processing?
Before you really explore other venues, I would like you to see if relational db is able to handle it. People have used relational databases for long time and have managed up to a few terra bytes of data warehouses. You may want to try below approaches before you make a decision on moving into hadoop.
Data Partitioning is the process of logically and/or physically dividing data into parts that are more easily maintained or accessed. Partitioning is supported by most of the popular open source relational databases (MySQL Partitioning and Postgres Partitioning )
Try Database Sharding Approach For Relational Database
Data base sharding can be used as last resort for hitting the edge of a relational database processing speed. This approach can be applied if you can logically separate the data in to different nodes and have less of cross-nodes joins in your analysis. In web applications, a common approach is to shard based on user and all information related to one user get stored on one node ensuring best speed.
Sharding is not easy and this scheme may not be suitable to you if you have a lot of complex relations and there is not easy way to separate data in different nodes. The purpose of sharding may be defeated if there are lot of cross nodes joins required for your application.
I have been asked by top management in different companies to choose Hadoop as an option for doing something. It has always been difficult to convince them, however when I bring this information to them they are forced to think twice. I was fortunate enough to save some money for the companies I worked for.
You may find that you have tried all possible option for scaling up your relational database. May be this is the time you should start looking into setting up a Hadoop cluster.
To start with, you may want to use the VM images provided by cloudera. Those are really handy for doing quick proof of concept on hadoop with your existing infrastructure.
What are your experience with big data? Please share with us in comments section.