Hadoop – Pig Tool Introduction


What is Pig Tool?  Why it is used in the Hadoop eco system?

First, let’s start with big data. Big data can be simply defined as a huge amount of data (in memory of TBs(1012 B) and PBs(1013B)) obtained from various sources in different varieties, e.g., documents, PDFs, CSV’s, logs, data from sensors, social; media. So, handling of Big data is a great challenge. To overcome the challenges of big data we need technology that can process the data quickly and accurately, simple architecture, data reliability (fault tolerance – i.e., the data should be available even if the hardware fails) and cost-effective.

To achieve the big data challenges Hadoop was introduced.

           “Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware”

Hadoop has two main components

  1. HDFS (Hadoop Distributed File System) — For Data Storage
  2. MapReduce — For Data Processing

To process the huge amount of data we need to write the java programs that are executed in MapReduce. So writing Java programs is a big challenge escpecially for data analyzing purpose.

Writing MapReduce jobs is where we will spend the majority of time interacting with Hadoop cluster. There are a number of frameworks for simplifying the writing of MapReduce jobs, with Hive, Pig, and Scalding.

Hive and Pig are query languages for interacting with your Hadoop data. Scalding is a Scala framework developed by Twitter that does not provide a query language, but does provide a very easy to use API.

               “Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin”

Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.

Why Pig??

It supports a variety of data types and the use of user-defined functions (UDFs) to write custom operations. Due to its simple interface, support for doing complex operations such as joins and filters are easier. Pig is popular for performing query operations in Hadoop.

Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data (Structured, Semi structured and Unstructured data)—hence the name! Hive supports only the structured data. 

Pig Components

Pig is made up of two components: the first is the language itself, which is called PigLatin and the second is a Runtime Environment where PigLatin programs are executed. 

We can execute the pig latin script in the local system as well as Hadoop cluster. So, pig script has two execution modes. One is local mode (type in terminal as “pig -x local”) and mapreduce/hadoop cluster mode (type in terminal as “pig” or “pig -x mapreduce“). (Hadoop and Pig should be installed and configured in your machine)

Leave a Reply