Hive Introduction


      Hive is a data warehouse system to store structured data on Hadoop file system. It facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop and provides an easy query to the data by executing Hadoop MapReduce plans. By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. (single user metadata stored into derby database and multiple users metadata stored into MySQL or other databases). Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE, initially developed by Facebook.

Data Warehouse: In computing, a data warehouse or enterprise data warehouse is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources.

Uses of Hive?

  1. The Apache Hive distributed storage.
  2. Hive provides the tools to enable easy data extract/transform/load (ETL)
  3. It provides the structure on a variety of data formats.
  4. By using Hive we can access to files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.

Limitations of Hive?

  • Hive is not designed for the Online transaction processing (OLTP ), it is only used for the Online analytical processing.
  • Hive supports overwriting or appending data, but not updates and deletes.
  • In a Hive Sub queries are not supported.

Difference between MYSQL and HIVE:

The main difference between RDBMs databases and Hive is specialization. While MySQL is general purpose database suited both for transactional processing (OLTP) and for analytics (OLAP), Hive is built for the analytics only. Technically the main difference is lack of update/delete functionality. Data can only by be added and selected. At the same time Hive is capable of processing data volumes which cannot be processed by MySQL or other conventional RDBMS (in shy budget). MPP (massive parallel processing) databases are closest to the Hive by their functionality – while they have full SQL support they are scalable up to hundreds of computers. Another serious different – is query language.

Hive does not support full SQL even in select because of its implementation. In my view, the main difference is lack of join for any condition other then equal. Hive query language sintax is also a bit different so you cannot connect report generation software right to Hive.

System Architecture and Components:

 MetastoreThe component that store the system catalog and meta data about tables, columns, partitions etc. Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all the information about the tables and partitions that are in the warehouse.By default, the metastore is run in the same process as the Hive service and the default metastore is DerBy Database.

DriverQuery compiler and execution engine converts SQL queries to a sequence of Hadoop MapReduce jobs.

  •      Query CompilerThe component that compiles HiveQL into a directed acyclic graph of map/reduce tasks.
  •     OptimizerConsists of a chain of transformations such that the operator DAG resulting from one transformation is passed as input to the next transformation Performs tasks like Column Pruning, Partition Pruning, Repartitioning of Data.
  •     Execution EngineThe component that executes the tasks produced by the compiler in proper dependency order. The execution engine interacts with the underlying Hadoop instance.

Hive Server: The component that provides a trift interface and a JDBC/ODBC server and provides a way of integrating Hive with other applications.

Client Components: Client component like Command Line Interface(CLI), the web UI and JDBC/ODBC driver.

Leave a Reply