Impala的count(distinct QUESTION_ID) 与ndv(QUESTION_ID)

时间:2020-12-29 张江英人气:0

在impala中，一个select执行多个count(distinct col)会报错，举例：

select C_DEPT2,
         count(distinct QUESTION_BUSI_ID) as wo_num,
         count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2

报错信息：

ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT QUESTION_BUSI_ID); deviating function: count(DISTINCT CREATOR_ID)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.

这时候，可通过以下方法解决：

1、得到的是近似值，数据量越大越不准确：

(1)SQL运行前，先运行命令：set APPX_COUNT_DISTINCT=true;

set APPX_COUNT_DISTINCT=true;
select C_DEPT2,
       count(distinct QUESTION_BUSI_ID) as wo_num,
       count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

(2)将count(distinct col)用函数ndv(col)代替

select C_DEPT2,
       ndv(QUESTION_BUSI_ID) as wo_num,
       ndv(CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

需要注意的是，在set APPX_COUNT_DISTINCT=true;的情况下，使用count(distinct col)会自动转化成ndv(col)，得到的是近似值，所以以上两种方法的结果数据一致。

2、精确值。拆分为子查询，再关联，如下：

set APPX_COUNT_DISTINCT = false; -- 将参数置为false，使用count(distinct col)，确保不会转化成ndv(col)
select a.C_DEPT2, a.wo_num, b.creator_num
  from (select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num
          from pdm.kudu_q_basic
         where substr(CREATE_DATE, 1, 7) = '2020-10'
         group by C_DEPT2) a
  left join (select C_DEPT2, count(distinct CREATOR_ID) as creator_num
               from pdm.kudu_q_basic
              where substr(CREATE_DATE, 1, 7) = '2020-10'
              group by C_DEPT2) b on a.C_DEPT2 = b.C_DEPT2
 order by a.C_DEPT2

验证：

select C_DEPT2, count(*)
  from pdm.kudu_q_basic -- 表中无重复数据
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

总结：解决在impala中一个select执行多个count(distinct col)报错问题，可以用过设置参数set APPX_COUNT_DISTINCT = true;或将count(distinct col)用ndv(col)解决，但得到的是近似值，不准确。还可以通过分别在子查询中进行count(distinct col)再关联得到准确值，但要注意参数 APPX_COUNT_DISTINCT = false，不然会自动转化为ndv(col)得到的还是近似值。

加载全部内容