Wednesday, March 10, 2010

Hadoop And Hive Configuration on Ubuntu kamic

1. To Install Hadoop
=================

Setting up your Apt Repository

1.

Add repository. Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents, taking care to replace DISTRO with the name of your distribution (find out by running lsb_release -c)

For the stable repository, use…

deb http://archive.cloudera.com/debian karmic-stable contrib
deb-src http://archive.cloudera.com/debian karmic-stable contrib

For the testing repository use…

deb http://archive.cloudera.com/debian karmic-testing contrib
deb-src http://archive.cloudera.com/debian karmic-testing contrib

2.

Add repository key. (optional) Add the Cloudera Public GPG Key to your repository by executing the following command:

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

This allows you to verify that you are downloading genuine packages.
3.

Update APT package index. Simply run:

sudo apt-get update

4.

Find and install packages. You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt-get, aptitude, or dselect). For example:

apt-cache search hadoop
sudo apt-get install hadoop

2. To Install Hive
===============

Installing Hive is simple and only requires having Java 1.6 and Ant installed on your machine.

Hive is available via SVN at http://svn.apache.org/repos/asf/hadoop/hive/trunk. You can download it by running the following command.

$ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive

To build hive, execute the following command on the base directory:

$ ant package

It will create the subdirectory build/dist with the following contents:

* README.txt: readme file.
* bin/: directory containing all the shell scripts
* lib/: directory containing all required jar files)
* conf/: directory with configuration files
* examples/: directory with sample input and query files

Subdirectory build/dist should contain all the files necessary to run hive. You can run it from there or copy it to a different location, if you prefer.

In order to run Hive, you must have hadoop in your path or have defined the environment variable HADOOP_HOME with the hadoop installation directory.

Moreover, we strongly advise users to create the HDFS directories /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before tables are created in Hive.

To use hive command line interface (cli) go to the hive home directory (the one with the contents of build/dist) and execute the following command:

$ bin/hive

Metadata is stored in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db

Using Derby in embedded mode allows at most one user at a time. To configure Derby to run in server mode, look at HiveDerbyServerMode.

3. Setting up Hadoop/Hive to use MySQL as metastore
================================================

Many believe MySQL is a better choice for such purpose, so here I'm going to show how we can configure our cluster which we created previously to use a MySQL server as the metastore for Hive.

First we need to install MySQL. In this scenario, I'm going to install MySQL on our Master node, which is named centos1.



When logged in as root user:

yum install mysql-server

Now make sure MySQL server is started:

/etc/init.d/mysqld start

Next, I'm going to create a new MySQL user for hadoop/hive:

mysql
mysql> CREATE USER 'hadoop'@'centos1' IDENTIFIED BY 'hadoop';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'centos1' WITH GRANT OPTION;
mysql> exit
To make sure this new user can connect to MySQL server, switch to user hadoop:

We need to change the hive configuration so it can use MySQL:

nano /hadoop/hive/conf/hive-site.xml

and new configuration values are:


hive.metastore.local
true



javax.jdo.option.ConnectionURL
jdbc:mysql://centos1:3306/hive?createDatabaseIfNotExist=true



javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver



javax.jdo.option.ConnectionUserName
hadoop



javax.jdo.option.ConnectionPassword
hadoop



Some of the above parameters do not match what we did to setup derby server in previous post, so I decided to delete the jpox.properties file:

rm /hadoop/hive/conf/jpox.properties
hive needs to have the MySQL jdbc drivers, so we need to download and copy it to hive/lib folder:

cd /hadoop
wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.11.tar.gz/from/http://mysql.he.net/
tar -xvzf mysql-connector-java-5.1.11.tar.gz
cp mysql-connector-java-5.1.11/*.jar /hadoop/hive/lib

To make sure all settings are done correctly, we can do this:

cd /hadoop/hive
bin/hive
hive> show tables;