{"id":123,"date":"2014-04-29T07:07:49","date_gmt":"2014-04-29T07:07:49","guid":{"rendered":"http:\/\/www.excelglobalsolution.com\/?p=123"},"modified":"2015-11-19T08:47:17","modified_gmt":"2015-11-19T08:47:17","slug":"hive-introduction","status":"publish","type":"post","link":"https:\/\/excelglobalsolution.com\/blogs\/?p=123","title":{"rendered":"Hive Introduction"},"content":{"rendered":"<p align=\"LEFT\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\"><strong><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><b>Introduction:<\/b><\/span><\/span><\/span><\/strong><\/span><\/span><\/p>\n<p style=\"text-align: justify;\" align=\"LEFT\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">\u00a0\u00a0\u00a0\u00a0\u00a0 Hive is a data warehouse system to store structured data on Hadoop file system. It facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop and provides an easy query to the data by executing Hadoop MapReduce plans.\u00a0<\/span><\/span><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">By default, Hive stores metadata in an embedded Apache Derby database, and other client\/server databases like MySQL can optionally be used. (single user metadata stored into derby database and multiple users metadata stored into MySQL or other databases).\u00a0<\/span><\/span><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE, i<\/span><\/span><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">nitially developed by Facebook.<\/span><\/span><\/p>\n<p style=\"text-align: justify;\" align=\"LEFT\"><strong><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><b><span style=\"text-decoration: underline;\">Data Warehouse<\/span>:<\/b><b> <\/b><\/span><\/span><\/strong><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">In computing, a data warehouse or enterprise data warehouse is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources.<\/span><\/span><\/p>\n<p align=\"LEFT\"><strong><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\"><b>Uses of Hive?<\/b><\/span><\/span><\/span><\/strong><\/p>\n<ol>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">The Apache Hive distributed storage.<\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">Hive provides the tools to enable easy data extract\/transform\/load (ETL)<\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">It provides the structure on a variety of data formats.<\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">By using Hive we can access to files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.<\/span><\/span><\/li>\n<\/ol>\n<p align=\"LEFT\"><strong><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\"><b>Limitations of Hive?<\/b><\/span><\/span><\/span><\/strong><\/p>\n<ul>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">Hive is not designed for the Online transaction processing (OLTP ), it is only used for the Online analytical processing.<\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">Hive supports overwriting or appending data, but not updates and deletes.<\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">In a Hive Sub queries are not supported.<\/span><\/span><\/li>\n<\/ul>\n<p align=\"LEFT\"><strong><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><b><span style=\"text-decoration: underline;\">Difference between MYSQL and HIVE<\/span>:<\/b><\/span><\/span><\/strong><\/p>\n<p style=\"font-family: Times New Roman,serif; text-align: justify; font-size: medium;\" align=\"LEFT\">The main difference between RDBMs databases and Hive is specialization. While MySQL is general purpose database suited both for transactional processing (OLTP) and for analytics (OLAP), Hive is built for the analytics only. Technically the main difference is lack of update\/delete functionality. Data can only by be added and selected. At the same time Hive is capable of processing data volumes which cannot be processed by MySQL or other conventional RDBMS (in shy budget). MPP (massive parallel processing) databases are closest to the Hive by their functionality \u2013 while they have full SQL support they are scalable up to hundreds of computers. Another serious different \u2013 is query language.<\/p>\n<p style=\"text-align: justify;\" align=\"JUSTIFY\"><span style=\"font-family: 'Times New Roman', serif;\"><span style=\"font-size: medium;\">Hive does not support full SQL even in select because of its implementation. In my view, the main difference is lack of join for any condition other then equal. Hive query language sintax is also a bit different so you cannot connect report generation software right to Hive.<\/span><\/span><\/p>\n<p style=\"text-align: justify;\" align=\"LEFT\"><span style=\"font-family: Times New Roman,serif;\"><b style=\"font-size: large;\"><span style=\"text-decoration: underline;\">System Architecture and Components<\/span>:<\/b><\/span><\/p>\n<p style=\"text-align: justify;\">\u00a0<span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\">Metastore<\/span>:\u00a0<\/span><\/span><\/span><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">The component that store the system catalog and meta data about tables, columns, partitions etc. Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all the information about the tables and partitions that are in the warehouse.By default, the metastore is run in the same process as the Hive service and the default metastore is DerBy Database.<\/span><\/span><\/span><\/p>\n<p style=\"text-align: justify;\" align=\"LEFT\"><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\">Driver<\/span>:\u00a0<\/span><\/span><\/span><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: medium;\">Query compiler and execution engine converts SQL queries to a sequence of Hadoop MapReduce jobs.<\/span><\/span><\/span><\/p>\n<ul>\n<li style=\"text-align: justify;\"><span style=\"color: #000000;\">\u00a0\u00a0\u00a0\u00a0 <span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\">Query Compiler<\/span>:\u00a0<\/span><\/span><span style=\"font-family: 'Times New Roman',serif;\"><span style=\"font-size: medium;\">The component that compiles HiveQL into a directed acyclic graph of map\/reduce tasks.<\/span><\/span><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"color: #000000;\"><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\">\u00a0\u00a0\u00a0 <span style=\"text-decoration: underline;\">Optimizer<\/span>:\u00a0<\/span><\/span><\/span><\/span><span style=\"font-family: Times New Roman,serif; font-size: medium;\">Consists of a chain of transformations such that the operator DAG resulting from one transformation is passed as input to the next transformation Performs tasks like Column Pruning, Partition Pruning, Repartitioning of Data.<\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"color: #000000;\"><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\">\u00a0 \u00a0\u00a0<span style=\"text-decoration: underline;\">Execution Engine<\/span>:\u00a0<\/span><\/span><\/span><\/span><span style=\"font-family: Times New Roman,serif; font-size: medium;\">The component that executes the tasks produced by the compiler in proper dependency order. The execution engine interacts with the underlying Hadoop instance.<\/span><\/li>\n<\/ul>\n<p lang=\"en-IN\" style=\"text-align: justify;\"><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\">Hive Server<\/span>:<\/span>\u00a0<\/span><\/span><\/span><span style=\"font-family: Times New Roman,serif; font-size: medium;\">The component that provides a trift interface and a JDBC\/ODBC server and provides a way of integrating Hive with other applications.<\/span><\/p>\n<p lang=\"en-IN\" style=\"text-align: justify;\"><span style=\"color: #000000;\"><span style=\"font-family: Times New Roman,serif;\"><span style=\"font-size: large;\"><span style=\"text-decoration: underline;\"><span style=\"font-size: large;\">Client Components<\/span><\/span><span style=\"font-size: large;\">:<\/span> <span style=\"font-size: medium;\">Client component like Command Line Interface(CLI), the web UI and JDBC\/ODBC driver.<\/span><\/span><\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction: \u00a0\u00a0\u00a0\u00a0\u00a0 Hive is a data warehouse system to store structured data on Hadoop file system. It facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop and &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49,3],"tags":[],"class_list":["post-123","post","type-post","status-publish","format-standard","hentry","category-hadoop","category-technology"],"_links":{"self":[{"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/123","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=123"}],"version-history":[{"count":11,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/123\/revisions"}],"predecessor-version":[{"id":201,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/123\/revisions\/201"}],"wp:attachment":[{"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=123"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=123"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/excelglobalsolution.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=123"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}