本文共 5625 字,大约阅读时间需要 18 分钟。
GROUP BY用于分组
除聚合函数外,所选的其他列也必须包含在GROUP BY中
select category, max(offervalue) from offers group by category;-- group by使用表达式select if(category > 4000, 'GOOD', 'BAD') as newcat,max(offervalue) from offers group by category if(category > 4000, 'GOOD', 'BAD');
Hive聚合运算 - HAVING
HAVING之后可以使用表达式,但不建议使用,会造成效率大大降低
-- having使用select sex_age.age from employee group by sex_age.age having count(*) <= 1;-- 使用子查询代替havingselect a.age from ( select count(*) as cnt, sex_age.age from employee group by sex_age.age ) a where a.cnt <= 1;
Hive聚合运算 - 基础聚合
基础聚合函数注意事项
对NULL的count聚合为0,即过滤了NULL
窗口函数是一组特殊函数
语法
Function (arg1,..., arg n) OVER ([PARTITION BY <...>] [ORDER BY <....>] [])
只有指定ORDER BY子句之后才能进行窗口定义
ROW_NUMBER()
序号唯一连续
RANK()
下一个序号跳过
(1,1,3)DENSE_RANK()
下一个序号连续
(1,1,2)NLITE(n)
PERCENT_RANK()
-- 窗口函数 排序类SELECT name, dept_num, salary, ROW_NUMBER() OVER () AS row_num, RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank, NTILE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile FROM employee_contract ORDER BY dept_num, salary;
COUNT()
计数,可以和DISTINCT一起用SELECT COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
SUM():求和
AVG():平均值 MAX()/MIN(): 最大/小值 从Hive 2.1.0开始在OVER子句中支持聚合函数SELECT rank() OVER (ORDER BY sum(b)) FROM T GROUP BY a;
-- 窗口函数 聚合类SELECT name, dept_num, salary,COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,--COUNT(DISTINCT *) OVER (PARTITION BY dept_num) AS row_cnt_dis,SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1, SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2,AVG(salary) OVER(PARTITION BY dept_num) AS avgDept,MIN(salary) OVER(PARTITION BY dept_num) AS minDept,MAX(salary) OVER(PARTITION BY dept_num) AS maxDeptFROM employee_contractORDER BY dept_num, name;
CUME_DIST
LEAD/LAG(column,n)
这个函数很有用,可用于分析频率,比如lag(购买时间,1),就可知道每次购买时间的频率
FIRST_VALUE
LAST_VALUE
-- 窗口函数 分析类SELECT name, dept_num, salary,LEAD(salary, 2) OVER(PARTITION BY dept_num ORDER BY salary) AS lead,LAG(salary, 2, 0) OVER(PARTITION BY dept_num ORDER BY salary) AS lag,FIRST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS first_value,LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS last_value_default,LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_valueFROM employee_contractORDER BY dept_num, salary;
窗口定义由[<window_clause>]子句描述
支持两类窗口定义
RANK、NTILE、DENSE_RANK、CUME_DIST、PERCENT_RANK、LEAD、LAG和ROW_NUMBER函数不支持与窗口子句一起使用
行窗口:根据当前行之前或之后的行号确定的窗口<start_expr>可以为下列值
<end_expr>可以为下列值
SELECTname, dept_num AS dept, salary AS sal,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS 2 PRECEDING) win12FROM employee_contract ORDER BY dept, name;
范围窗口是取分组内的值在指定范围区间内的行
SUM(close) RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING-- 假设当前close值的行数为3000,语句将包含分区内范围从2500到4000的行
-- 示例SELECT name, dept_num AS dept, salary AS sal,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,salary - 1000 as sal_r_start,salary as sal_r_end,MAX(salary) OVER (PARTITION BY dept_num ORDER BY name RANGE BETWEEN 1000 PRECEDING AND CURRENT ROW) win13FROM employee_contract ORDER BY dept, name;
转载地址:http://pcjxi.baihongyu.com/