博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hive高级查询(二)
阅读量:4160 次
发布时间:2019-05-26

本文共 5625 字,大约阅读时间需要 18 分钟。

文章目录

一.Hive聚合运算 - GROUP BY

GROUP BY用于分组

  • Hive基本内置聚合函数与GROUP BY一起使用
  • 如果没有指定GROUP BY子句,则默认聚合整个表
  • 除聚合函数外,所选的其他列也必须包含在GROUP BY中
  • GROUP BY支持使用CASE WHEN或表达式
select category, max(offervalue) from offers group by category;-- group by使用表达式select if(category > 4000, 'GOOD', 'BAD') as newcat,max(offervalue) from offers group by category if(category > 4000, 'GOOD', 'BAD');

Hive聚合运算 - HAVING

  • HAVING:对GROUP BY聚合结果的条件过滤
  • 可以避免在GROUP BY之后使用子查询
  • HAVING之后可以使用表达式,但不建议使用,会造成效率大大降低
-- having使用select sex_age.age from employee group by sex_age.age having count(*) <= 1;-- 使用子查询代替havingselect a.age from ( select count(*) as cnt, sex_age.age from employee group by sex_age.age ) a where a.cnt <= 1;

Hive聚合运算 - 基础聚合

基础聚合函数

  • max, min, count, sum, avg
  • max(distinct col1)、avg(col2)等
  • collect_set, collect_list:返回每个组列中的对象集/列表

注意事项

  • 一般与GROUP BY一起使用
  • 可应用于列或表达式
  • 对NULL的count聚合为0,即过滤了NULL

二.窗口函数 - 概述

窗口函数是一组特殊函数

  • 扫描多个输入行来计算每个输出值,为每行数据生成一行结果
  • 可以通过窗口函数来实现复杂的计算和聚合

语法

Function (arg1,..., arg n) OVER ([PARTITION BY <...>] [ORDER BY <....>] [
])
  • PARTITION BY类似于GROUP BY,未指定则按整个结果集
  • 只有指定ORDER BY子句之后才能进行窗口定义
  • 可同时使用多个窗口函数
  • 过滤窗口函数计算结果必须在外面一层
  • 按功能可划分为:排序,聚合,分析

1.窗口函数 - 排序

ROW_NUMBER()

  • 对所有数值输出不同的序号,序号唯一连续

RANK()

  • 对相同数值,输出相同的序号,下一个序号跳过(1,1,3)

DENSE_RANK()

  • 对相同数值,输出相同的序号,下一个序号连续(1,1,2)

NLITE(n)

  • 将有序的数据集合平均分配到n个桶中(若不能均分一般第一个桶数据会多些), 将桶号分配给每一行,根据桶号,选取前或后 n分之几的数据
  • 举例:若想查询订单记录的前1/3记录,可用NLITE(3)平均分成三份再套个查询语句使用where条件桶号=1即可实现

PERCENT_RANK()

  • (目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
-- 窗口函数 排序类SELECT name, dept_num, salary, ROW_NUMBER() OVER () AS row_num, RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank, NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile FROM employee_contract ORDER BY dept_num, salary;

2. 窗口函数 - 聚合

COUNT()

计数,可以和DISTINCT一起用

SELECT COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

SUM():求和

AVG():平均值
MAX()/MIN(): 最大/小值
从Hive 2.1.0开始在OVER子句中支持聚合函数

SELECT rank() OVER (ORDER BY sum(b))  FROM T GROUP BY a;
-- 窗口函数 聚合类SELECT name, dept_num, salary,COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,--COUNT(DISTINCT *) OVER (PARTITION BY dept_num) AS row_cnt_dis,SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1, SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2,AVG(salary) OVER(PARTITION BY dept_num) AS avgDept,MIN(salary) OVER(PARTITION BY dept_num) AS minDept,MAX(salary) OVER(PARTITION BY dept_num) AS maxDeptFROM employee_contractORDER BY dept_num, name;

3.窗口函数 - 分析

CUME_DIST

  • 小于等于当前值的行数/分组内总行数

LEAD/LAG(column,n)

  • 某一列进行往前/后第n行值(n可选,默认为1)
  • 这个函数很有用,可用于分析频率,比如lag(购买时间,1),就可知道每次购买时间的频率

FIRST_VALUE

  • 对该列到目前为止的首个值

LAST_VALUE

  • 到目前行为止的最后一个值
-- 窗口函数 分析类SELECT name, dept_num, salary,LEAD(salary, 2) OVER(PARTITION BY dept_num ORDER BY salary) AS lead,LAG(salary, 2, 0) OVER(PARTITION BY dept_num ORDER BY salary) AS lag,FIRST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS first_value,LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS last_value_default,LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_valueFROM employee_contractORDER BY dept_num, salary;

4.窗口函数 - 窗口定义(必须使用order by)

窗口定义由[<window_clause>]子句描述

  • 用于进一步细分结果并应用分析函数

在这里插入图片描述

支持两类窗口定义

  • 行类型窗口
  • 范围类型窗口

RANK、NTILE、DENSE_RANK、CUME_DIST、PERCENT_RANK、LEAD、LAG和ROW_NUMBER函数不支持与窗口子句一起使用

行窗口:根据当前行之前或之后的行号确定的窗口

  • ROWS BETWEEN <start_expr> AND <end_expr>

<start_expr>可以为下列值

  • UNBOUNDED PRECEDING : 窗口起始位置(分组第一行)
  • CURRENT ROW:当前行
  • N PRECEDING/FOLLOWING:当前行之前/之后n行

<end_expr>可以为下列值

  • UNBOUNDED FOLLOWING : 窗口结束位置(分组最后一行)
  • CURRENT ROW:当前行
  • N PRECEDING/FOLLOWING:当前行之前/之后n行
SELECTname, dept_num AS dept, salary AS sal,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS 2 PRECEDING) win12FROM employee_contract  ORDER BY dept, name;

范围窗口是取分组内的值在指定范围区间内的行

  • 该范围值/区间必须是数字或日期类型
  • 目前只支持一个ORDER BY列
SUM(close) RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING-- 假设当前close值的行数为3000,语句将包含分区内范围从2500到4000的行
-- 示例SELECT name, dept_num AS dept, salary AS sal,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,salary - 1000 as sal_r_start,salary as sal_r_end,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name RANGE BETWEEN 1000 PRECEDING AND CURRENT ROW) win13FROM employee_contract ORDER BY dept, name;

转载地址:http://pcjxi.baihongyu.com/

你可能感兴趣的文章
Java多线程之线程池Executor
查看>>
Java虚拟机之自动内存管理
查看>>
Java虚拟机之内存分配详解
查看>>
Java虚拟机之垃圾收集器
查看>>
Java虚拟机之常见参数配置
查看>>
Java虚拟机之性能监控工具
查看>>
性能优化之Java程序优化细节(珍藏版)
查看>>
SSO单点登录
查看>>
jmap错误:unknown CollectedHeap type : class sun.jvm.hotspot.gc_interface.CollectedHeap
查看>>
Intellij IDEA中查看System.gc()堆栈日志
查看>>
MySQL死锁的案例
查看>>
Ant环境安装(Windows10+CentOS7)
查看>>
CentOS安装MySQL客户端
查看>>
Docker启动报错:SELinux is not supported with the overlay2 graph driver
查看>>
CentOS7安装hadoop2.7.3-单机版
查看>>
深入理解HDFS原理
查看>>
Intellij IDEA连接Hadoop HDFS实现本地调试
查看>>
MongoDB数据类型
查看>>
MongoDB的java客户端连接池配置说明
查看>>
MongoDB在CentOS7下的yum方式和压缩包方式安装
查看>>