Using big data technologies for your business is a really a attractive thing, and Apache Hadoop makes it even more appealing now a days,
Hadoop is a massively scalable data storage platform that is used as foundation for many big data projects.
Hadoop is powerful, however it has a steep learning curve and significant investments required from the company in terms of time and other resources.
Hadoop can be really a game changer for your company if applied the right way, however there are plenty of ways it can go wrong.
On the other hand many of the businesses (unlike Google, Facebook or Twitter) are not really having "The Big Data" that needs a huge hadoop cluster to analyze things, however the Hadoop buzzword is attracting them.
As David Wheeler said "All problems in computer science can be solved by another level of indirection". Hadoop offers one such level of indirection; As a software architect, it may be really difficult to take the right decision when your top management is biased about some buzzwords.
In this article, I am trying to suggest some alternatives that should be tried before making an investment into Hadoop
Know Your Data
Size Of Overall Data
Hadoop is designed to work efficiently on large data sets. Just to give an idea,
- Hadoop has a file system (HDFS) which likes files sizes of several gigabytes. Therefore, in case you have your data file size in few megabytes, its recommended that you combine (zip or tar) several files into one file to make it in range of hundreds of megabyte or few gigabyte.
- HDFS Splits the files are stores it into blocks size of 64MB or 128MB or bigger.
If you have a data set which is relatively small then it may not be the best use of a giant ecosystem of hadoop. This requires you to start understanding your data even more, analyze what types of queries are needed and see if your data is really BIG.
On the other hand, just measuring data by the size of your database may be deceptive since your computations may be larger. At times you may be doing more of a mathematical computing or analyzing the permutation of a small data set that can be much larger than actual data. Therefore the key is to “Know Your Data, and know it well”.
Speed Of Data Growth (Growth Velocity)
You may have a several terabytes of data in your data warehouse or various other data sources. However, one very important factor you must consider before you start creating a Hadoop Cluster is the growth of that data.
Ask some simple questions to your data analysts like
- How fast is this data growing? Is this data growing with very fast pace?
- What will be the size of data after few months or years from now?
Several companies have grown the data over years not days or months. In case your data growth is not really fast enough, I would suggest considering options of Archiving and Purging (described later in this article) instead of direct jumping on to creating a Hadoop cluster.
How To Reduce Your Data
If you think you really have a large data, you may want to consider some of the below approaches for reducing your data into relatively manageable size. Following options have been successfully used in industry for several years.
Data archiving is the process of moving stale data to a separate data storage for long-term retention (if required).
This requires a real good understanding of you data, and use cases for your applications. Big data processing eCommerce companies keeps order details of 3 months in a live database, however older order information is kept in a separate storage.
This approach can also be applied to your data warehouse. You can keep more recent data in a storage for faster reporting and querying. The less frequent used data can be on a different storage device.
Consider Purging Data
At times we are busy collecting data and not really sure that how much we should keep. If you are storing a lot of data that may not be useful it is going to make your recent data processing slower. Understand your business needs and see if older data can be removed and trends from that data can be stored instead. This will not only save you space but also help you speed up on analyzing recent data.
One common best practice for this is having standard columns in your data warehouse like created_date, created_by, update_date, updated_by. Now looking at these columns create a daily/monthly cron job that can purge the data in your data warehouse after the duration you do not want to see the data in your data warehouse. The logic of purging data may vary based on your domain therefore give some thought to it before implementing it.
If you are using a archiving tool, it can also be easily configured to purge data on archive.
All Data Is Not Important
You may be tempted to keep all the data you have for your business. There are variety of sources your data may be coming in e.g. log files, live transaction, vendor integration, ETL job, marketing campaign data etc. You should know that, not all data is business critical and may not be helpful keeping in a data warehouse. Filter unwanted data at the source itself, before even it gets stored in your data warehouse. Go mad about each and every column you are storing in your database tables and analyze if you really need that.
Be Mindful Of What You Want To Collect As Data
Lets say you are into a online video editing business. Would you like to keep all the changes made by your users on each video? This can be a huge volume. You may need to think about storing meta data in cases when you feel your data warehouse may not be able to handle it. Video editing is a really hypothetical example, however it may be applicable to a lot of other information that is related to the data you are storing.
In general, if you have some relational data there is a chance that you may get it from multiple sources and not all of it needs to be stored in your data warehouse.
Hire Analysts Who Understand The Business
By now you may have understood that knowing the data is really important for managing it efficiently. Trust me, this step is going to help you even more when you think I have tried all these things and its time we move to a big data solution like Hadoop.
Hadoop will be almost useless if you data analysts do not understand what to extract out of it. Invest in people who understand business. Encourage them to do experiments and learn new ways to look at the same data. Figure out what can be low hanging wins with existing infrastructure.
Use Statistical Sampling For Decision Making
Statistical sampling is one of the very old techniques used by researchers and mathematicians to extrapolate reasonable conclusions for a large data.
By conducting a statistical sample, our volume can be cut down immensely. Instead of tracking billions or millions data points, we only need to randomly pick a few thousands or hundreds.
This technique will not provide accurate results, however it may be used for getting a high level understanding of a large data set.
Scaling TechniquesHave You Really Hit The Edge Of Relational Database Processing?
Before you really explore other venues, I would like you to see if relational db is able to handle it. People have used relational databases for long time and have managed up to a few terra bytes of data warehouses. You may want to try below approaches before you make a decision on moving into hadoop.
Data Partitioning is the process of logically and/or physically dividing data into parts that are more easily maintained or accessed. Partitioning is supported by most of the popular open source relational databases (MySQL Partitioning and Postgres Partitioning )
Try Database Sharding Approach For Relational Database
Data base sharding can be used as last resort for hitting the edge of a relational database processing speed. This approach can be applied if you can logically separate the data in to different nodes and have less of cross-nodes joins in your analysis. In web applications, a common approach is to shard based on user and all information related to one user get stored on one node ensuring best speed.
Sharding is not easy and this scheme may not be suitable to you if you have a lot of complex relations and there is not easy way to separate data in different nodes. The purpose of sharding may be defeated if there are lot of cross nodes joins required for your application.
I have been asked by top management in different companies to choose Haddop as an option for doing something. It has always been difficult to convince them, however when I bring this information to them they are forced to think twice. I was fortunate enough to save some money for the companies I worked for.
You may find that you have tried all possible option for scaling up your relational database. May be this is the time you should start looking into setting up a Hadoop cluster.
To start with, you may want to use the VM images provided by cloudera. Those are really handy for doing quick proof of concept on hadoop with your existing infrastructure.
What are your experience with big data? Please share with us in comments section.