Using big data technologies for your business is a really attractive thing, and Apache Hadoop makes it even more appealing nowadays,
Hadoop is a massively scalable data storage platform that is used as a foundation for many big data projects. Learn more about Hadoop at Hadoop FAQ & Hadoop best books pages.
Hadoop is powerful, however, it has a steep learning curve and significant investments required from the company in terms of time and other resources.
Hadoop can be really a game-changer for your company if applied the right way, however, there are plenty of ways it can go wrong.
On the other hand, many of the businesses (unlike Google, Facebook, or Twitter) are not really having “The Big Data” that needs a huge Hadoop cluster to analyze things, however, the Hadoop buzzword is attracting them. Learn data science course in bangalore.
In this article, I am trying to suggest some non-map reduce alternative approaches that should be tried before making an investment into Hadoop. You may also want to check out other MapReduce implementations as Hadoop Alternatives – Part 2.
When To Use Hadoop
Here are three simple use cases when to use Hadoop for doing big data analysis.
- Size Of Overall Data is too large and can not be handled by current systems.
- Speed Of Data Growth (Growth Velocity) is high and can not be managed by current systems.
- Very Complex Data Reporting and Analysis from multiple structured and nonstructured data sources.
- Data processing can be done offline (not real-time)
Use Cases For Hadoop: When To Use
- Large data size use cases are very obvious. Websites like Facebook, Twitter, and Google face a large volume of user visits and their database contains several terabytes of user data including user information, files, images, and videos.
- The speed of data growth use cases can be very common in online trading systems where billions of transactions are processed within hours. A common example is online trading websites and stock exchanges.
- Very Complex Data Reporting and Analysis – These use cases are often applicable to fraud analysis. The available structured data is often not sufficient for doing fraud analysis and therefore you require a lot of non-structured data that can be difficult to analyze using relational systems.
- There can be many scenarios where data processing can be done offline and does not require user interaction. The biggest example is a friend’s suggestions on all social networks. You need not calculate a potential friend immediately, this data can be processed offline and aggregated. Facebook, twitter, and Linkedin use this technique to suggest you, friends.
When Not To Use Hadoop
Here are the same obvious use cases when not to use Hadoop for doing big data analysis. We discuss each of these in the latter part of the article.
- The size Of Overall Data is handled by current systems.
- Speed Of Data Growth (Growth Velocity) can be managed by current systems.
- Simple Data Reporting and Analysis Use cases
Use Cases When Not To Use Hadoop
- Content websites, blogs, and small eCommerce stores need not use Hadoop and big data processing. Even the worlds most-visited blog sites and news websites may not have the need to process big data if they are just serving static content.
- Some companies have years of data that is sitting in one database. The data size has grown too big over a period of time, however, the data growth is not high. Such use cases should focus on finding out the best way to use relational systems before moving to Hadoop-like systems.
Know Your Data
Let’s discuss how we can avoid using Hadoop and make the best use of existing systems and resources.
Size Of Overall Data
Hadoop is designed to work efficiently on large data sets. Just to give an idea,
- Hadoop has a file system (HDFS) that likes file sizes of several gigabytes. Therefore, in case you have your data file size in few megabytes, it is recommended that you combine (zip or tar) several files into one file to make it in the range of hundreds of megabytes or few gigabytes.
- HDFS Splits the files are stores it into blocks size of 64MB or 128MB or bigger.
If you have a data set that is relatively small then it may not be the best use of a giant ecosystem of Hadoop. This requires you to start understanding your data, even more, analyze what types of queries are needed, and see if your data is really BIG.
On the other hand, just measuring data by the size of your database may be deceptive since your computations may be larger. At times you may be doing more of mathematical computing or analyzing the permutation of a small data set that can be much larger than actual data. Therefore the key is to “Know Your Data, and know it well”.
Speed Of Data Growth (Growth Velocity)
You may have several terabytes of data in your data warehouse or various other data sources. However, one very important factor you must consider before you start creating a Hadoop Cluster is the growth of that data.
Ask some simple questions to your data analysts like
- How fast is this data growing? Is this data growing at a very fast pace?
- What will be the size of data after a few months or years from now?
Several companies have grown the data over years not days or months. In case your data growth is not really fast enough, I would suggest considering options of Archiving and Purging (described later in this article) instead of direct jumping on to creating a Hadoop cluster.
The complexity of Data Reporting and Analysis
At times the overall data may not be at the scale of terabytes, however, the reporting needs are so complex that it requires big data process. The most mathematical calculations would fit in this use case. However, there can also be complex business reports that can require big data analysis. These use cases are limited however you may be easily able to spot them.
How To Reduce Your Data
If you think you really have a large data, you may want to consider some of the below approaches for reducing your data into a relatively manageable size. The following options have been successfully used in the industry for several years.
Data archiving is the process of moving stale data to separate data storage for long-term retention (if required).
This requires a really good understanding of your data, and use cases for your applications. Big data processing eCommerce companies keep order details of 3 months in a live database, however, older order information is kept in separate storage.
This approach can also be applied to your data warehouse. You can keep more recent data in storage for faster reporting and querying. The less frequently used data can be on a different storage device.
Consider Purging Data
At times we are busy collecting data and not really sure how much we should keep. If you are storing a lot of data that may not be useful it is going to make your recent data processing slower. Understand your business needs and see if older data can be removed and trends from that data can be stored instead. This will not only save you space but also help you speed up on analyzing recent data.
One common best practice for this is having standard columns in your data warehouse like created_date, created_by, update_date, updated_by. Now looking at these columns create a daily/monthly cron job that can purge the data in your data warehouse after the duration you do not want to see the data in your data warehouse. The logic of purging data may vary based on your domain therefore give some thought to it before implementing it.
If you are using an archiving tool, it can also be easily configured to purge data on the archive.
All Data Is Not Important
You may be tempted to keep all the data you have for your business. There are a variety of sources your data may be coming in e.g. log files, live transaction, vendor integration, ETL job, marketing campaign data, etc. You should know that not all data is business-critical and may not be helpful in keeping in a data warehouse. Filter unwanted data at the source itself, before even it gets stored in your data warehouse. Go mad about each and every column you are storing in your database tables and analyze if you really need that.
Be Mindful Of What You Want To Collect As Data
Let’s say you are into an online video editing business. Would you like to keep all the changes made by your users on each video? This can be a huge volume. You may need to think about storing metadata in cases when you feel your data warehouse may not be able to handle it. Video editing is a really hypothetical example, however, it may be applicable to a lot of other information that is related to the data you are storing.
In general, if you have some relational data there is a chance that you may get it from multiple sources and not all of it needs to be stored in your data warehouse.
Hire Analysts Who Understand The Business
By now you may have understood that knowing the data is really important for managing it efficiently. Trust me, this step is going to help you, even more, when you think I have tried all these things, and its time we move to a big data solution like Hadoop.
Hadoop will be almost useless if your data analysts do not understand what to extract out of it. Invest in people who understand business. Encourage them to do experiments and learn new ways to look at the same data. Figure out what can be low hanging wins with existing infrastructure.
Use Statistical Sampling For Decision Making
Statistical sampling is one of the very old techniques used by researchers and mathematicians to extrapolate reasonable conclusions for large data.
By conducting a statistical sample, our volume can be cut down immensely. Instead of tracking billions or millions of data points, we only need to randomly pick a few thousands or hundreds.
This technique will not provide accurate results, however, it may be used for getting a high-level understanding of a large data set.
Have You Really Hit The Edge Of Relational Database Processing?
Before you really explore other venues, I would like you to see if relational DB is able to handle it. People have used relational databases for a long time and have managed up to a few terra bytes of data warehouses. You may want to try the below approaches before you make a decision on moving into Hadoop.
Data Partitioning is the process of logically and/or physically dividing data into parts that are more easily maintained or accessed. Partitioning is supported by most of the popular open-source relational databases (MySQL Partitioning and Postgres Partitioning )
Try Database Sharding Approach For Relational Database
Database sharding can be used as a last resort for hitting the edge of a relational database processing speed. This approach can be applied if you can logically separate the data into different nodes and have less cross-nodes joins in your analysis. In web applications, a common approach is to shard based on the user, and all information related to one user gets stored on one node ensuring the best speed.
Sharding is not easy and this scheme may not be suitable to you if you have a lot of complex relations and there is no an easy way to separate data in different nodes. The purpose of sharding may be defeated if there are a lot of cross nodes joins required for your application.
I have been asked by top management in different companies to choose Hadoop as an option for doing something. It has always been difficult to convince them, however, when I bring this information to them they are forced to think twice. I was fortunate enough to save some money for the companies I worked for.
You may find that you have tried all possible options for scaling up your relational database. Maybe this is the time you should start looking into setting up a Hadoop cluster.
To start with, you may want to use the VM images provided by Cloudera. Those are really handy for doing quick proof of concept on Hadoop with your existing infrastructure.
What is your experience with big data? Please share with us in the comments section.
Sachin, one other option to look at is the HPCC Systems platform. My personal experience has been that I was able to download the VM image and get started with loading and transforming data in a few minutes. Thanks to their central deployment tool, setting up a 3 node Ubuntu cluster was really simple. The inherent parallelism and data flow nature of the powerful ECL language removes the worry about trying to parallelize my jobs, as was the case in my experience with Hadoop MapReduce. In fact, I have to say ECL is somewhat similar to SQL from the perspective both are declarative data programming languages. So if you are a good SQL developer, ECL should be a breeze to understand and use. It is a mature platform and provides for a data delivery engine together with a data transformation and linking system. The main advantages over other alternatives are the real-time delivery of data queries and the extremely powerful ECL language programming model. More info at
HPCC sounds interesting project, is it opensource too? I have not really spent much time on other options for big data other than hadoop
Hey agree with all the points.. Very insightful.. Must firms are installing hadoop cause it's fashionable even though they don't need it