Presto

一、服务端安装

下载最新版本presto-server-0.247.tar.gz并解压

需要配置以下几项,详细解说见官方文档，这里只是本机快速配置

Node Properties: 每个节点的环境配置
JVM Config: 命令行的Java虚拟机选项
Config Properties: Presto服务器的配置
Catalog Properties: 连接器(数据源)的配置

1.1 Node Properties

文件位置etc/node.properties

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

1.2 JVM Config

文件位置etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

1.3 Config Properties

文件位置etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080

1.4 Catalog Properties

文件位置etc/catalog/xx.properties

详细内容参考官方Connectors链接

1.4.1 Kudu Connector

文件位置 etc/catalog/kudu.properties

内容如下：

connector.name=kudu
kudu.client.master-addresses=localhost
kudu.schema-emulation.enabled=true
kudu.schema-emulation.prefix=

查询例子

创建表

CREATE TABLE kudu.default.users (
  user_id int WITH (primary_key = true),
  first_name varchar,
  last_name varchar
) WITH (
  partition_by_hash_columns = ARRAY['user_id'],
  partition_by_hash_buckets = 2
);

查看表Schema

DESCRIBE kudu.default.users;

插入数据

INSERT INTO kudu.default.users VALUES (1, 'Donald', 'Duck'), (2, 'Mickey', 'Mouse');

查询数据

SELECT * FROM kudu.default.users;

1.5 运行

后台运行

bin/launcher start

前台运行

bin/launcher run

停止

bin/launcher stop

1.6 命令行客户端安装

下载最新的客户端[Jar]https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.247/presto-cli-0.247-executable.jar)包

下载好以后给予执行权限chmod +x 然后执行即可。--catalog 是连接器的名称

./presto --server localhost:8080 --catalog kudu --schema default

二、常用SQL

2.1 计算中位数

SELECT id, approx_percentile(value, 0.5)
FROM (
    VALUES
        (1, 2),
        (1, 3),
        (1, 7),
        (1, 8),
        (1, 4),
        (2, 20),
        (2, 30),
        (2, 40)
) AS t(id, value)
GROUP BY id;

2.2 String日期转Date

Date and Time Functions and Operators

select from_iso8601_date('2021-03-03') + interval '2' day
date_diff('second',date_parse(bcc.live_start_date,'%Y-%m-%d %H:%i:%s'),date_parse(bcc.live_end_date,'%Y-%m-%d %H:%i:%s'))

2.3 导出数据

sql="
select distinct username from odl_user_uc where length(username) >3 and not regexp_like(username,'\w+')
"
presto --server namenode1:8093 --catalog hive --schema default --execute "${sql}" --output-format CSV_HEADER > test

2.4 行专列

    select 
        t8.*
        ,order_id
    from tablexxx t8
    CROSS JOIN UNNEST(split(first_case_order_id,','))  as order_id(order_id)

-- hive 
SELECT student,score FROM tests LATERAL VIEW explode(scores)t AS score

-- presto
SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);

2.5 日期加减

date_diff('day',date('2021-01-01'),date('2022-01-10'))
to_iso8601(date_add('day',-1,date(substr(end_date,1,10)))) as end_date

2.6 行转列

select 
    a.classscheduleid
    ,array_join(array_distinct(ARRAY_AGG(a.teacherid)),',') as teacherid
    ,array_join(array_distinct(ARRAY_AGG(b.realname)),',') as realname
    ,array_join(array_distinct(ARRAY_AGG(b.nationality)),',') as nationality
from odl_new_schedule_teachers_class a
left join odl_teacherpersonalinfo_class b on a.teacherid=b.teacherid
where a.isdeleted=false
and a.classscheduleid=6799756
group by a.classscheduleid

 classscheduleid |     teacherid      |      realname       | nationality
-----------------+--------------------+---------------------+-------------
         6799756 | 104490086,38422508 | username1,username2 | JP,CN

2.7 时间戳转换为时间

presto:default> SELECT date_format(from_unixtime(1612345678), '%Y-%m-%d') as date;
    date
------------
 2021-02-03

presto:default> SELECT date_format(from_unixtime(1612345678), '%Y-%m-%d %H:%i:%s') as datetime;
      datetime
---------------------
 2021-02-03 17:47:58

2.8 防止因为和hive不兼容跑数的时候报错

一般都是meta错误

set session hive.collect_column_statistics_on_write = false;

2.9 解析JSON

-- 假设你的 JSON 数据如下
-- {"name": "John", "age": 30, "city": "New York"}

-- 查询表中的 JSON 数据，并提取特定键的值
SELECT
  json_extract_scalar(json_column, '$.name') AS name,
  json_extract_scalar(json_column, '$.age') AS age,
  json_extract_scalar(json_column, '$.city') AS city
FROM
  your_table;