After covering basics of partitioning, let’s move on to some advanced partitioning concepts.
If you haven’t read the first part, then you may want to read it first Partitioning with PostgreSQL v11
In this article, I’ll cover the following topics:
- Sub partitioning (Partition of Partitions)
- Attaching/Detaching Partition
I’ll use process_partition table from previous article, with an added attribute:created_year.
Above structure shows, process_partition is the master table and it has 3 partitions, namely, process_partition_open, process_partition_in_progress and process_partition_done, which are partitioned based on the values of status attribute which can be OPEN, IN_PROGRESS and DONE.
As we start using the system, which uses the process_partition table, then over a span of time, the partition sizes would gradually increase. As DONE is the final state of a process, so eventually all processes will be dumped in process_partition_done partition.
This will lead to skewed distribution of data among the partitions as process_partition_done will have only incoming data. On the other hand, process_partition_open and process_partition_in_progress will not grow that drastically like process_partition_done because both of these partitions have moving data.
process_partition_done becomes the bottle-neck and it’s quite possible that whenever we query on rows with DONE status it will give bad performance.
As process_partition_done is growing very fast, at the same time we notice that, there exists a greater portion of data which is not queried and is just kept in the table for archival purpose. Now the problem is, the old data can’t be deleted but it’s not queried also, hence resulting in poor performance.
Solution: Partition the partitions
Wait, what !! Yes you heard it right, we’ll partition the partition, rather we can simply call it as sub-partition.
We’ll sub-partition the process_partition_done table into process_partition_done_2018, process_partition_done_2019 and process_partition_done_2020which are partitioned based on the values of created_year column which can be 2018, 2019 and 2020. This is how our structure will now look like :
By the above partitioning structure, I want to establish 2 things :
- You can sub-partition, i.e., you can partition a table which is already a partition of some other master table.
- And, flexibility that postgres offers, as you are not forced to sub-partition each and every partition of the master if you wish to sub-partition only one of the them.
Let’s start with creating our master table process_partition, as our structure is now ready.
CREATE TABLE process_partition (
id bigserial ,
name character varying(255),
status character varying(255) NOT NULL,
CONSTRAINT process_partition_pk_id PRIMARY KEY (id, status, created_year)
)PARTITION BY LIST(status);--Partitions based on status for values OPEN and IN_PROGRESSCREATE TABLE process_partition_in_progress PARTITION OF process_partition
FOR VALUES IN ('IN_PROGRESS');
CREATE TABLE process_partition_open PARTITION OF process_partition
FOR VALUES IN ('OPEN');
Note : created_year is also included in PRIMARY_KEY because it’ll be used to create sub-partition. Partitioning key needs to be present in all constraints like PRIMARY_KEY, UNIQUE etc.
We have created 2 partitions based for status OPEN and DONE. We’ll create the third partition for DONE status which will again have partition based on created_year .
CREATE TABLE process_partition_done (
id bigserial ,
name character varying(255),
status character varying(255) NOT NULL,
CONSTRAINT process_partition_done_pk_id PRIMARY KEY (id, status, created_year)
)PARTITION BY LIST(created_year);
--Partitions based on created_year for values 2018, 2019, 2020
CREATE TABLE process_partition_done_2018 PARTITION OF process_partition_done
FOR VALUES IN (2018);
CREATE TABLE process_partition_done_2019 PARTITION OF process_partition_done
FOR VALUES IN (2019);
CREATE TABLE process_partition_done_2020 PARTITION OF process_partition_done
FOR VALUES IN (2020);
In practical scenarios, instead of created_year , date attribute is present in tables. Here, we can use RANGE partition, i.e., the range of dates belonging to same year can be part of same partition.
process_partition_done table is created which will act as our third partition for the master table process_partition , but before it can start accepting data with DONE status, we need to ATTACH this table with master table.
ALTER TABLE process_partition ATTACH PARTITION process_partition_done FOR VALUES IN ('DONE');
Now, process_partition_done is attached master table and it’ll accept the rows with DONE status.
ALTER TABLE <table_name> ATTACH_PARTITION <partition_table_name> give us the capability to add partitions dynamically. This partition table can be with/without data.
Let’s insert some data into master table process_partition , such that all partitions and sub-partitions have data.
INSERT INTO process_partition (name, status, created_year) VALUES ('OPEN process', 'OPEN', 2018),
('OPEN process', 'OPEN', 2019),
('OPEN process', 'OPEN', 2020),
('IN_PROGRESS process', 'IN_PROGRESS', 2018),
('IN_PROGRESS process', 'IN_PROGRESS', 2019),
('IN_PROGRESS process', 'IN_PROGRESS', 2020),
('DONE process', 'DONE', 2018),
('DONE process', 'DONE', 2019),
('DONE process', 'DONE', 2020);
Above data will be split into OPEN and IN_PROGRESS partitions and in sub-partitions based on created_year for status DONE.
When we issue,EXPLAIN SELECT * FROM process_partition, we find that there’s no mention of process_partition_done, instead all sub-partitions of this partition are mentioned.
This is because when we create partition of any table then data is only held by the partition tables and not the master table. We can confirm this by checking the result of the following query: select * from only process_partition_done.
Also, we can reference each sub- partition of process_partition_done directly because PRIMARY_KEY of process_partition has (id, status, created_year). If we only give status='DONE' then we’ll have to scan all the 3 sub partitions but as we add a filter on created_year then it’ll directly point to only one sub- partition.
We observe that, when we use both status and created_year to filter the results then directly one of the sub-partitions can be referenced to fetch the results.
Let’s come back to our original question, why process_partition_done does not appear in query plan of query SELECT * FROM process_partition . As data of partition process_partition_done is contained in the sub-partitions and also PRIMARY_KEY of master table already includes all the columns to directly access sub-partition’s data, that’s why it looks like a flat hierarchy instead of 2 level hierarchy.
We have done sub-partitioning successfully.
Sometimes you will come across a scenario in which you want to query without partitioning key, that means you need to scan whole table instead you might want to ignore some obsolete data.
Seems like Postgres has a solution for it!
Now, let’s remove sub-partition which contains data for 2018. How does it help?
By doing so we can detach the unused old data which will increase our query performance as it’ll have to scan less data.
Let’s see how we can detach partition:
ALTER TABLE process_partition_done DETACH PARTITION process_partition_done_2018;
After detaching the partition if we explain and also execute query SELECT * FROM process_partition , we’ll notice that process_partition_done_2018 doesn’t show up in the query plan and also its data is not available (id = 7 doesn’t show up in the results).
If we still wish to query the detached partition table we can still directly access the table SELECT * FROM process_partition_done_2018 . However, this data won’t be available under process_partition master table. Later, if we again want to do some operations on the old data, then we can directly do it on process_partition_done_2018 table or again re-attach it any time as a partition to master table and use it.
ALTER TABLE process_partition_done ATTACH PARTITION process_partition_done_2018 FOR VALUES IN (2018);
After re-attaching the partition, things will start working as before. Just to verify it, let’s try inserting data in this partition.
INSERT INTO process_partition(name, status, created_year) VALUES('DONE process after detach', 'DONE', 2018);
I think flexibility that Postgres provides can help us tackle different scenarios. We can segregate data based on partition and also attach/detach partitions dynamically which can help us to optimise the query performance.
That’s it !!