PostgreSQL游标与索引选择

时间:2022-09-29 foucus、人气:0

之前有写过一个案例，order by limit因为数据分布不均而选择了错误的索引，这是由于优化器没法判断数据的分布关系，默认认为数据分布是均匀的所导致的。

而除了limit，当我们在使用游标时也要注意有可能会出现类似的情况。而往往这类在存储过程中的SQL我们更难发现其选择了错误的执行计划，所以需要注意。

1、建测试表

bill=# create table tbl (id int, c1 int, c2 int, c3 int, c4 int);
CREATE TABLE

2、写入一批随机数据，ID从1到1000万。

bill=# insert into tbl select generate_series(1,10000000), random()*100, random()*100, random()*100, random()*100;  
INSERT 0 10000000

3、写入另一批100万条数据，c1,c2 与前面1000万的值不一样。

bill=# insert into tbl select generate_series(10000001,11000000), 200,200,200,200;  
INSERT 0 1000000

4、创建两个索引，也就是本文需要重点关注的，到底走哪个索引更划算

bill=# create index idx_tbl_1 on tbl(id);  
CREATE INDEX  
bill=# create index idx_tbl_2 on tbl(c1,c2,c3,c4);  
CREATE INDEX

5、收集统计信息

bill=# vacuum analyze tbl;  
VACUUM

6、查看下面SQL的执行计划，走了正确的索引

bill=# explain select * from tbl where c1=200 and c2=200 order by id;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Sort  (cost=72109.20..72344.16 rows=93984 width=20)
   Sort Key: id
   ->  Bitmap Heap Scan on tbl  (cost=1392.77..60811.81 rows=93984 width=20)
         Recheck Cond: ((c1 = 200) AND (c2 = 200))
         ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..1369.28 rows=93984 width=0)
               Index Cond: ((c1 = 200) AND (c2 = 200))
(6 rows)

7、而当我们在游标中使用该SQL时，会发现执行计划出现了偏差

bill=# begin;
BEGIN
bill=*# explain declare tt cursor for select * from tbl where c1=200 and c2=200 order by id;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Index Scan using idx_tbl_1 on tbl  (cost=0.43..329277.60 rows=93984 width=20)
   Filter: ((c1 = 200) AND (c2 = 200))
(2 rows)

为什么会出现这种情况呢，这其实是因为使用游标的SQL会根据cursor_tuple_fraction参数进行自动优化，而该参数默认是0.1，表示只检索前10%的行进行预估，这就和limit有点异曲同工的味道了。

因为对于这张表，优化器认为数据是均匀分布的，而实际上，数据分布是不均匀的，c1=200 and c2=200的记录在表的末端。当我们在游标中只检索了前10%的行，所以会得到一个错误的执行计划。

具体的细节我们可以在parsenodes.h和planner.c中看到：

当使用cursor或者SPI_PREPARE_CURSOR函数时，会设置CURSOR_OPT_FAST_PLAN标志位，然后就会根据cursor_tuple_fraction参数对SQL进行自动优化，所以对于一些数据分布不均的情况，可能就会

导致选择了错误的执行计划。
	/* Determine what fraction of the plan is likely to be scanned */
	if (cursorOptions & CURSOR_OPT_FAST_PLAN)
	{
		/*
		 * We have no real idea how many tuples the user will ultimately FETCH
		 * from a cursor, but it is often the case that he doesn't want 'em
		 * all, or would prefer a fast-start plan anyway so that he can
		 * process some of the tuples sooner.  Use a GUC parameter to decide
		 * what fraction to optimize for.
		 */
		tuple_fraction = cursor_tuple_fraction;
		/*
		 * We document cursor_tuple_fraction as simply being a fraction, which
		 * means the edge cases 0 and 1 have to be treated specially here.  We
		 * convert 1 to 0 ("all the tuples") and 0 to a very small fraction.
		 */
		if (tuple_fraction >= 1.0)
			tuple_fraction = 0.0;
		else if (tuple_fraction <= 0.0)
			tuple_fraction = 1e-10;
	}
	else
	{
		/* Default assumption is we need all the tuples */
		tuple_fraction = 0.0;
	}

加载全部内容