Origin of Hadoop:
Now-a-days, users of web applications / social networking sites are increasing and along with that the usage of data is also getting increased vastly (Terra Bytes and Peta Bytes). To handle that large amounts of data, Doug Cutting, Cloudera’s Chief Architect, helped create Apache Hadoop out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it.
Hadoop is the solution found for processing and storing the huge amount of data (structured, unstructured, log files, pictures, audio files, communications records, email). It is inexpensive, industry standard servers which are capable of handling processing and storing of data by using the concept of distributed parallel processing of data.
Hadoop is a framework which is used for storage and processing of large-scale data-sets on clusters of commodity hardware. It is a framework written in Java which is originally developed by Doug Cutting.
Commodity Hardware means using large numbers of already available computing components (i.e., hard disks or memory storage components) for parallel computing to get the greatest amount of useful computation at low cost.
The Apache Hadoop framework is composed of the following modules:
- Hadoop Common – contains libraries and utilities needed by other Hadoop modules
- Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
- Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users applications.
- Hadoop MapReduce – a programming model for large scale data processing.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common i.e., individual machines or racks of machines, and thus should be automatically handled in software by the framework.
Apart from the proposed modules like HDFS, YARN and MapReduce, the entire Apache Hadoop platform is now commonly considered to consist of a number of related projects as well:
- Apache Pig
- Apache Hive
- Apache HBase
- Apache Spark, and others.
Technical Knowledge Requirements: For Hadoop we need to have the knowledge of Java as Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts. Though MapReduce Java code is common, any programming language can be used with “Hadoop Streaming” to implement the “map” and “reduce” parts of the user’s program.
Install / Deploy Hadoop: Hadoop can be installed in 3 modes:
- Standalone mode: To deploy Hadoop in standalone mode, we just need to set path of JAVA_HOME. In this mode there is no need to start the daemons and no need of name node format as data save in local disk.
- Pseudo Distributed mode: In this mode all the daemons (nameNode, dataNode, secondaryNameNode, jobTracker, taskTracker) run on a single machine.
- Distributed mode: In this mode, daemons (nameNode, jobTracker, secondaryNameNode(Optionally)) run on master (NameNode) and daemons (dataNode and taskTracker) run on slave (DataNode).
Note : Daemon → In multitasking computer operating systems, a daemon is a computer program that runs as a background process, rather than being under the direct control of an interactive user.
Architecture of Hadoop:
Top Level Interfaces : In this section the ETL Tools, BI Reporting and RDBMS will fall.
Top Level Abstractions : In this section the PIG, HIve and Squoop will fall.
Data Processing : In this section the Map-Reduce and HBase will fall.
Storage : In this section the HDFS cluster will fall.