Java 8 Installation: http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/
$ cd /opt
$ wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz"
$ tar xzf jdk-8u45-linux-x64.tar.gz
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.109/presto-server-0.109.tar.gz
$ tar -xvf presto-server-0.109.tar.gz
$ mv presto-server-0.109 /usr/local/presto
$ cd /usr/local/presto
$ mkdir /usr/local/presto/etc
Configuring Presto:
Node Properties:
$ vi etc/node.properties
# The following is a minimal etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
$ mkdir /var/presto
$ mkdir /var/presto/data
JVM Config:
$ vi etc/jvm.config
# The following provides a good starting point for creating etc/jvm.config:
-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
Config Properties:
$ vi etc/config.properties
# if you are setting up a single machine for testing that will function as both a coordinator and worker, use this configuration:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8084
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8084
Log Levels:
$ vi etc/log.properties
com.facebook.presto=INFO
Catalog Properties:
$ mkdir etc/catalog
Hive Connector:
$ vi etc/catalog/hive.properties
#Apache Hadoop 1.x: hive-hadoop1
#Apache Hadoop 2.x: hive-hadoop2
#Cloudera CDH 4: hive-cdh4
# Cloudera CDH 5: hive-cdh5
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
Running Presto:
$ bin/launcher start # To run as a daemon (at the background)
$ bin/launcher run # to run in the foreground
$ cd /usr/local/presto/bin
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.109/presto-cli-0.109-executable.jar
$ mv presto-cli-0.109-executable.jar presto
$ chmod +x presto
--------------
Hive Demo:
--------------
$ bin/presto --server localhost:8084 --catalog hive --schema default
Hive Tutorial: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
> SELECT year, max(runs) FROM batting GROUP BY year;
> SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;
===================================
Create Kafka Catalog:
https://prestodb.io/docs/current/connector/kafka-tutorial.html
$ vi etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=localhost:6667
kafka.table-names=test,twitter,tweets
kafka.hide-internal-columns=false
$ bin/presto --server localhost:8084 --catalog kafka --schema default
Queries:
> select count(*) from tweets;
> SELECT DISTINCT json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets LIMIT 5;
> SELECT created_at, raw_date FROM ( SELECT created_at, json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets) GROUP BY 1, 2 LIMIT 5;
$ cd /opt
$ wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz"
$ tar xzf jdk-8u45-linux-x64.tar.gz
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.109/presto-server-0.109.tar.gz
$ tar -xvf presto-server-0.109.tar.gz
$ mv presto-server-0.109 /usr/local/presto
$ cd /usr/local/presto
$ mkdir /usr/local/presto/etc
Configuring Presto:
Node Properties:
$ vi etc/node.properties
# The following is a minimal etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
$ mkdir /var/presto
$ mkdir /var/presto/data
JVM Config:
$ vi etc/jvm.config
# The following provides a good starting point for creating etc/jvm.config:
-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
Config Properties:
$ vi etc/config.properties
# if you are setting up a single machine for testing that will function as both a coordinator and worker, use this configuration:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8084
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8084
Log Levels:
$ vi etc/log.properties
com.facebook.presto=INFO
Catalog Properties:
$ mkdir etc/catalog
Hive Connector:
$ vi etc/catalog/hive.properties
#Apache Hadoop 1.x: hive-hadoop1
#Apache Hadoop 2.x: hive-hadoop2
#Cloudera CDH 4: hive-cdh4
# Cloudera CDH 5: hive-cdh5
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
Running Presto:
$ bin/launcher start # To run as a daemon (at the background)
$ bin/launcher run # to run in the foreground
$ cd /usr/local/presto/bin
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.109/presto-cli-0.109-executable.jar
$ mv presto-cli-0.109-executable.jar presto
$ chmod +x presto
--------------
Hive Demo:
--------------
$ bin/presto --server localhost:8084 --catalog hive --schema default
Hive Tutorial: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
> SELECT year, max(runs) FROM batting GROUP BY year;
> SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;
===================================
Create Kafka Catalog:
https://prestodb.io/docs/current/connector/kafka-tutorial.html
$ vi etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=localhost:6667
kafka.table-names=test,twitter,tweets
kafka.hide-internal-columns=false
$ bin/presto --server localhost:8084 --catalog kafka --schema default
Queries:
> select count(*) from tweets;
> SELECT DISTINCT json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets LIMIT 5;
> SELECT created_at, raw_date FROM ( SELECT created_at, json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets) GROUP BY 1, 2 LIMIT 5;
Spark SQL vs Presto : http://www.qubole.com/blog/product/sql-on-hadoop-evaluation-by-pearson/
ReplyDelete