Friday, July 10, 2015

Presto Tutorial

Java 8 Installation: http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/

$ cd /opt

$ wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz"

$ tar xzf jdk-8u45-linux-x64.tar.gz

$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.109/presto-server-0.109.tar.gz

$ tar -xvf presto-server-0.109.tar.gz

$ mv presto-server-0.109 /usr/local/presto

$ cd /usr/local/presto

$ mkdir /usr/local/presto/etc

Configuring Presto:

Node Properties:

$ vi etc/node.properties

# The following is a minimal etc/node.properties

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

$ mkdir /var/presto

$ mkdir /var/presto/data

JVM Config:

$ vi etc/jvm.config

# The following provides a good starting point for creating etc/jvm.config:

-server
-Xmx16G
-XX:+UseConcMarkSweepGC
-XX:+ExplicitGCInvokesConcurrent
-XX:+AggressiveOpts
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

Config Properties:

$ vi etc/config.properties

# if you are setting up a single machine for testing that will function as both a coordinator and worker, use this configuration:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8084
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8084

Log Levels:

$ vi etc/log.properties

com.facebook.presto=INFO

Catalog Properties:

$ mkdir etc/catalog

Hive Connector:

$ vi etc/catalog/hive.properties


#Apache Hadoop 1.x: hive-hadoop1
#Apache Hadoop 2.x: hive-hadoop2
#Cloudera CDH 4: hive-cdh4
# Cloudera CDH 5: hive-cdh5

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083



Running Presto:

$ bin/launcher start   # To run as a daemon (at the background)

$ bin/launcher run # to run in the foreground

$ cd /usr/local/presto/bin

$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.109/presto-cli-0.109-executable.jar

$ mv presto-cli-0.109-executable.jar presto

$ chmod +x presto

--------------
Hive Demo:
--------------

$ bin/presto --server localhost:8084 --catalog hive --schema default

Hive Tutorial: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

> SELECT year, max(runs) FROM batting GROUP BY year;

> SELECT a.year, a.player_id, a.runs from batting a  JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b ON (a.year = b.year AND a.runs = b.runs) ;

===================================

Create Kafka Catalog:

https://prestodb.io/docs/current/connector/kafka-tutorial.html

$ vi etc/catalog/kafka.properties

connector.name=kafka
kafka.nodes=localhost:6667
kafka.table-names=test,twitter,tweets
kafka.hide-internal-columns=false


$ bin/presto --server localhost:8084 --catalog kafka --schema default

Queries:

> select count(*) from tweets;

> SELECT DISTINCT json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets LIMIT 5;

> SELECT created_at, raw_date FROM ( SELECT created_at, json_extract_scalar(_message, '$.created_at') AS raw_date FROM tweets) GROUP BY 1, 2 LIMIT 5;

1 comment:

  1. Spark SQL vs Presto : http://www.qubole.com/blog/product/sql-on-hadoop-evaluation-by-pearson/

    ReplyDelete