In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop.
What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability.
What are the core components of Hadoop?
There are two major components of the Hadoop framework and both fo them does two of the important task for it.
Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post).
Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post).
Besides above two core components Hadoop project also contains following modules as well.
Hadoop Common: Common utilities for the other Hadoop modules
Hadoop Yarn: A framework for job scheduling and cluster resource management
There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts.
A Multi-node Hadoop Cluster Architecture
Now let us quickly see the architecture of the a multi-node Hadoop cluster.
A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFC Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop.
In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail.
Why Use Hadoop?
There are many advantages of using Hadoop. Let me quickly list them over here:
Robust and Scalable – We can add new nodes as needed as well modify them.
Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server.
Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data.
Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node.
Why Hadoop is named as Hadoop?
In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant.
Tomorrow
In tomorrow’s blog post we will discuss Buzz Word – MapReduce.
Reference: Pinal Dave (http://blog.sqlauthority.com)
Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL