Make delicious recipes!

Hadoop Terminology: Pig, Hive, HCatalog, HBase and Sqoop

Hadoop is the big boss when it comes to dealing with big data that runs into terabytes.
It typically serves two purposes:
  1. Storing humongous amounts of data: This is achieved by partitioning the data among several nodes.
    Block-size in Hadoop File System is also much larger (64 or 128 MB) than normal file-systems (64kb).

  2. Bringing computation to data: Traditionally, data is brought to clients for computation.
    But data stored in Hadoop is so large that it is more efficient to do the opposite.
    This is done by writing map-reduce jobs which run closer to the data stored in the Hadoop.

HDFS (Hadoop Distributed File System): HDFS is responsible for:
  1. Distributing the data across the nodes,
  2. Managing replication for redundancy and
  3. Administrative tasks like adding, removing and recovery of data nodes.

HCatalog: This is a tool that holds location and metadata of the HDFS.
This way it completely abstracts the HDFS details from other Hadoop clients like Pig and Hive.
It provides a table abstraction so that users need not be concerned with where or how their data is stored.

Hive  ()  provides SQL like querying capabilities to view data stored in the HDFS.
It has its own query language called HiveQL.
Beeswax is a tool used to interact with Hive. It can take in queries from user to Hive.
 Select * from person_table where last_name = "smith";
 describe person_table;
 select count(*) from person_table;
Thus, for SQL programmers, Hive provides this facility to become productive immediately.

Pig  ()  is a language used to run MapReduce jobs on Hadoop.
It supports MapReduce programs in several languages including Java.
 a = LOAD 'person_table' USING org.apache.hcatalog.pig.HCatLoader();
 b = FILTER a BY last_name == 'smith';
 c = group b all;
 d = foreach c generate AVG(b.age);
dump d;

HBase  ()   is an open source, non-relational, distributed database running on top of Hadoop.
Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through Java APIs as well as through REST, Avro or Thrift gateway APIs.

Sqoop  ()   is a command-line interface application for transferring data between relational databases and Hadoop.
It supports:
  1. Incremental loads of a single table,
  2. A free form SQL query,
  3. Saved jobs which can be run multiple times to import updates made to a database since the last import.
Imports from Sqoop be used to populate tables in Hive or HBase.
And exports from it can be used to put data from Hadoop into a relational database.

Difference between Pig and Hive

Pig is a scripting language for Hadoop developed at Yahoo! in 2006.
Hive is a SQL like querying language for Hadoop developed parallelly at Facebook.

Pig allows querying too, but its syntax is not SQL like due to which there is some learning curve.
But once you are comfortable with Pig, it provides more power than Hive.

Pig is procedural, so one can write small transformations step by step.
Due to this, Pig is also easier to debug because the results of these small steps can be printed for debugging issues.

Hive is much more suitable for Business Analysts familiar with SQL (as they can quickly write SQL and get away without very fine optimization of their extraction/querying etc.) while Pig is more suitable for software engineers writing very complicated scripts that are not suitable for writing as SQL queries.

Like us on Facebook to remain in touch
with the latest in technology and tutorials!

Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Email: (Your email is not shared with anybody)

Facebook comments:

Site Owner: Sachin Goyal