Hadoop Terminology: Pig, Hive, HCatalog, HBase and Sqoop
Hadoop is the big boss when it comes to dealing with big data that runs into terabytes.
It typically serves two purposes:
HDFS (Hadoop Distributed File System): HDFS is responsible for:
HCatalog: This is a tool that holds location and metadata of the HDFS.
This way it completely abstracts the HDFS details from other Hadoop clients like Pig and Hive.
It provides a table abstraction so that users need not be concerned with where or how their data is stored.
Hive () provides SQL like querying capabilities to view data stored in the HDFS.
It has its own query language called HiveQL.
Beeswax is a tool used to interact with Hive. It can take in queries from user to Hive.
Select * from person_table where last_name = "smith"; describe person_table; select count(*) from person_table;Thus, for SQL programmers, Hive provides this facility to become productive immediately.
Pig () is a language used to run MapReduce jobs on Hadoop.
It supports MapReduce programs in several languages including Java.
a = LOAD 'person_table' USING org.apache.hcatalog.pig.HCatLoader(); b = FILTER a BY last_name == 'smith'; c = group b all; d = foreach c generate AVG(b.age); dump d;
HBase () is an open source, non-relational, distributed database running on top of Hadoop.
Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through Java APIs as well as through REST, Avro or Thrift gateway APIs.
Sqoop () is a command-line interface application for transferring data between relational databases and Hadoop.
And exports from it can be used to put data from Hadoop into a relational database.
Difference between Pig and HivePig is a scripting language for Hadoop developed at Yahoo! in 2006.
Hive is a SQL like querying language for Hadoop developed parallelly at Facebook.
Pig allows querying too, but its syntax is not SQL like due to which there is some learning curve.
But once you are comfortable with Pig, it provides more power than Hive.
Pig is procedural, so one can write small transformations step by step.
Due to this, Pig is also easier to debug because the results of these small steps can be printed for debugging issues.
Hive is much more suitable for Business Analysts familiar with SQL (as they can quickly write SQL and get away without very fine optimization of their extraction/querying etc.) while Pig is more suitable for software engineers writing very complicated scripts that are not suitable for writing as SQL queries.
|Email:||(Your email is not shared with anybody)|