Tuesday, August 20, 2013

Hadoop

Hello Visitor,
First of all, thank you for visiting my blog.

I've decided to learn Big Data (August 2013) and would like to post my learning or search results in this blog to capture for myself. I'm posting my learning or notes here. 

Special Thanks to Swamy Gurram for the Big Data Knowledge sharing and motivation as always.

First link to start Hadoop http://hadoop.apache.org

Single Node Setup: http://hadoop.apache.org/docs/stable/single_node_setup.html

HDFS layer of Hadoop Video: http://www.youtube.com/watch?v=ziqx2hJY8Hg

Daemon: is a computer program that runs as a background process. It is similar to services in Windows OS. More details are http://en.wikipedia.org/wiki/Daemon_(computing)

A simple definition of HDFS: HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Install Hadoop on Ubuntu Linux (Single-Node Cluster):

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.youtube.com/watch?v=WN2tJk_oL6E

https://www.dropbox.com/s/05aurcp42asuktp/Chiu%20Hadoop%20Pig%20Install%20Instructions.docx

So, I'm able to complete installation within 4 hours (2-3 hours for Linux and 1 hour for Hadoop and Pig).

Installing Eclipse Juno 4.2 in Ubuntu 12.04 or in Ubuntu 12.10
http://akovid.blogspot.co.uk/2012/08/installing-eclipse-juno-42-in-ubuntu.html

Hadoop MapReduce Fundamentals (Highly technical)

1 of 5: http://www.youtube.com/watch?v=7FcMhTTG1Cs (30 mins)
2 of 5: http://www.youtube.com/watch?v=pDGLe4CsrhY (1 hr)
3 of 5: http://www.youtube.com/watch?v=9h_WLsmRfFM  (1 hr) - Windows Azure bases
4 of 5: http://www.youtube.com/watch?v=iiIDZTpdcuU (1 hr)
5 of 5: http://www.youtube.com/watch?v=1aen3JsxkuM (20 mins)

Data-Driven Documents - http://d3js.org/ Great graphical representation of data.

Q: Hadoop fs and Hadoop dfs command difference?
A: fs command represents both OS file system and Hadoop file system. dfs command represents only Hadoop file system

Note: The best way to optimize MapReduce is simply to add more nodes (this is whole idea behind the Hadoop)

Installations

VMWare ESXi Virtualization
http://www.youtube.com/watch?v=ba3qqJI6ML4 (ESXi, VSphere Clinet and Virtual OS with explanation) - 40 mins.

http://www.youtube.com/watch?v=ZBl1Tf2A4lA (Just ESXi and VSphere Client) - 10 mins.

Java JDK 7 Installation:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
java -version

TomCat7 Installation:

sudo apt-get update
sudo apt-get install tomcat7
sudo service tomcat7 stop
Set JAVA_HOME as below by editing the tomcat7 default start-up configuration file
sudo nano /etc/default/tomcat7
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_09 (or the available Java JDK path in your Linux)
for example, JAVA_HOME=/usr/lib/jvm/java-7-oracle (in case of Oracle Java)

sudo service tomcat7 start
/usr/share/tomcat7/bin/version.sh
wget localhost:8080
Ref: http://hendrelouw73.wordpress.com/2012/11/14/how-to-install-apache-tomcat-7-0-30-on-ubuntu-12-10-linux/


Hadoop Multi-Node-Cluster Instillation (Thanks to Michael G. Noll - You made it so easy)

Installation of hadoop-on-ubuntu-linux-multi-node-cluster 

Eclipse Juno Installation:
http://akovid.blogspot.co.uk/2012/08/installing-eclipse-juno-42-in-ubuntu.html

Hive / HCatalog Installation:

$ java -version
$ hadoop version
$ wget http://www.gtlib.gatech.edu/pub/apache/hive/stable/hive-0.11.0-bin.tar.gz
$ jps
$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir /user/hive/warehouse
$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /user/hive/warehouse
$ tar -xzvf hive-0.11.0-bin.tar.gz
$ mv hive-0.11.0-bin hive
$ cd hive
$ pwd
$ export HIVE_HOME=/home/hduser/hive
$ export PATH=$HIVE_HOME/bin:$PATH
$ hive


Problems during hadoop learning
1. Hadoop: Cannot use Jps command
Alt + F2, gksudo gedit .bashrc
# Add Java bin/ directory to PATH
export PATH=$PATH:$JAVA_HOME/bin

And restart Ubuntu

2. WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock (FSNamesystem.java:1639)
Solution: HDFS format is not completed. It might have ended up having message as aborted with message as /app/hadoop/tmp/dfs/name cannot delete, etc. Please make sure you use Y (not small 'y') when you say Yes for HDFS format during the installation.

3. Issues during cloudera Hive / Impala execution: (Error: cause=Permission denied: user=root, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x).
Sol: use hdfs as super user and change permission on HDFS as below
sudo -u hdfs hadoop fs -chown cloudera:root /user/root

Note: hdfs is not having any password (blank password).

Refer below link too ...
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAORpBsgJG1d=FOoYpCxEvKo+Z+yWi5UPrjcbT5y50Q6Heffq9w@mail.gmail.com%3E



Online Courses

1. Hadoop Fundamentals
My first baby step towards BigData. I've finished Hadoop Fundamentals in BigDataUniversity site with 13/20 marks :). It has very basic topics on HDFS, Pig, Hive, JAQL, Hadoop Administration and Flume. The course is sweet and short along with lab practice videos.

http://bigdatauniversity.com/courses/course/view.php?id=516

2 comments:

  1. Very Impressive Data Science tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Data Science course. I'm also a learner taken up Data Science training and I think your content has cleared some concepts of mine. While browsing for Data Science tutorials on YouTube i found this fantastic video on Data Science. Do check it out if you are interested to know more.:-https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  2. Very Impressive Big Data Hadoop tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Big Data Hadoop course. I'm also a learner taken up Big Data Hadoop Tutorial and I think your content has cleared some concepts of mine. While browsing for Hadoop tutorials on YouTube i found this fantastic video on Big Data Hadoop Tutorial.Do check it out if you are interested to know more.https://www.youtube.com/watch?v=nuPp-TiEeeQ&

    ReplyDelete