They Way things Were
Previously, data was backed up to tape and someone had to drive to a data center and upload the data to one of the backup servers available. During disaster recovery, you will have to drive over to get the tapes and spend hours recovery lost data. That is still the case in certain situations. However, real time replication and the need to make lost data available at a faster rate has prompted a change in strategy. Data storage companies now offer snapshot technology that allows users to easily go back in time and retrieve lost data.
Successful data storage and retrieval have been one of the major reasons why people wanted access to distributed computing. Today’s corporations want commercial intelligence and the capacity to have access to flexible, predictable IT infrastructures. This also includes service-on-demand with an eye on cost. With these factors in mind, cloud computing services have been making headway. Organizations do not want to throw out all their IT infrastructure and invest in new ones just because the business context has changed. They want to be able to scale their IT environment to meet the meet current needs and not let IT dictates how they do business. That is the appeal of cloud computing. However, data security seems to be one of the major concerns preventing a lot of business from putting their data in the cloud.
Hadoop: When Cloud Computing Is not just about Storage
There is more to life in the cloud than storage and retrieval. Some companies want to be able to access their data and run continuous analysis for business intelligence. You can invest in a brand new clustered computing environment and employ database and system specialists to management the setup. That is what many companies do. What do you do when you no longer need the data? You will have a few dead weights to get rid off.
At moments like these, you have the option of taking a more pragmatic approach and using a software framework like Hadoop within a cloud context. What does that imply?
If your company has a lot of structured and unstructured data that have been lying around for years and you don’t want to throw them away because they might be useful. On the other hand, you haven’t done anything with the data for years because you don’t know how to take advantage of the information you have at hand. That is what Hadoop is designed to do, help you make sense of the data you have. Imagine the way Google uses its search algorithm to crawl the web and try to give users meaningful and relevant information. That is partly what can be achieved with Hadoop. By the way, Google was one of the early pioneers of the Hadoop project with its MapReduce software framework.
Hadoop is designed to work in a distributed computing environment where big tasks are divided into smaller chunks. These smaller tasks are then distributed over multiple computers for faster execution. With this in mind, most companies will rather run multiple computing task on a single server in order to reduce cost. This is where cloud computing meets Hadoop. Companies can purchase multiple virtual machines from cloud vendors like Amazon and its EC2 program or through Microsoft Azure. These machines can be configured to run Hadoop in a clustered parallel computing configuration. This helps increase speed of execution and improves efficiency. If the company requires better performance, it is a case of asking the service provider to increase Storage, CPU, RAM and network speed. These resources can be scaled up or down according to the current business needs.
Hadoop in the cloud is about dealing with terabytes of data. The idea is used not only by business with big data. Non-profit organizations that need a lot of computing power to run meaningful data analysis can also benefit from combining Hadoop in a cloud configuration.
Apache Hadoop core relies on the MapReduce and the Hadoop Distributed File System (HFDS). The details are beyond the scope of this article. It is a viable way to meet the needs of those who want to put some order and derive improved analysis from their “Big Data”.