Azure Synapse Analytics - Dedicated SQL Pool Indexing
Azure Synapse Analytics - Dedicated SQL Pool Indexing
Clustered index usually refers to clustered rowstore index.
Non-clustered index usually refers to non-clustered rowstore index.
Clustered (rowstore) Index
A clustered index sorts and physically stores the rows of data in a table based on the index's key values.
Because the data can only be sorted in one way, there can be only one clustered [rowstore | columnstore] index per table.
When a table has a clustered (rowstore) index, it is called a clustered table.
When a table has a clustered columnstore index, it is called a clustered columnstore table;
When a table has no clustered [rowstore | columnstore] index it is known as a heap.
Creating a clustered index on a table where the data is frequently accessed in a sorted manner can improve performance since the data is stored in key order.
Clustered index is effective for tables where frequent lookups are performed using a key,
ensuring that data retrieval is optimized by the physical ordering.
Non-Clustered (rowstore) Index:
A non-clustered index has a structure separate from the data rows.
It includes index key values and each key value entry has a pointer to the data row containing the key value.
This means that non-clustered indexes require additional storage because the index is stored separately from the actual data.
Non-clustered indexes are generally used for columns that are frequently accessed in queries but are not the primary key improving data retrival time without affecting the physical storage order of the underlying table.
Since they are separate from the data, a table can have multiple non-clustered indexes
Clustered Columnstore Index (CCI):
This index stores data in a columnar format, becoming the primary method for storing the entire table. It offers high levels of data compression and efficient query performance for large data sets.
Often used for data warehousing scenarios where large fact tables need to be queried for analytical reports.
Non-Clustered Columnstore Index (NCCI):
Acts as a secondary index on a rowstore table to provide columnstore capabilities. It is used alongside the primary rowstore index.
Applied in scenarios requiring real-time operational analytics, where both transactional processing and analytic queries run on the same dataset.
This type of index is particularly well-suited for scenarios where there's a need to balance the efficiency of analytical query processing (especially over several columns) with the operational needs of frequent updates.
The Non-Clustered Columnstore Index allows for fast analytical queries on large data sets while maintaining the base table in a more update-friendly rowstore format.
A Nonclustered Columnstore Index should be used when you need to improve analytical query performance on a table that primarily supports transactional operations (OLTP) with a rowstore format, enabling efficient analytics without altering the underlying table's transactional capabilities by not affecting the physical storage order of the underlying table.
Heap:
A heap is a table storage format that lacks a clustered [rowstore | columnstore] index.
In a heap, data rows are stored without a predefined order, meaning they are not sorted according to any specific criteria.
Heaps allow for the insertion of data into any available location within the table.
While heaps do not have the overhead of maintaining the sorted order of a clustered index, they can still possess non-clustered indexes.
These non-clustered indexes can improve access to data by maintaining pointers to the data rows, but they do not influence the physical storage order of the rows in the table.
Heaps are uitable for staging data or when the insertion speed is critical and the order of rows is not a concern.
For clustered columnstore tables, it's crucial to have a minimum of 1 million rows per distribution and partition to achieve optimal compression and performance.
Keep in mind that, by default, a dedicated SQL pool divides each table into 60 distributions. Therefore, when you add partitions to a table, they are in addition to the distributions.
If you have 36 monthly partitions for a sales fact table and 60 distributions, each month should ideally contain around 60 million rows (2.160 billion rows in total when all months are populated).
Fragmentation
It occurs when the logical ordering of the index data becomes different from its physical ordering, often due to frequent data modifications—such as inserts, updates, and deletes—that disperse the data across non-contiguous disk pages. This fragmentation leads to inefficient I/O during data queries, as the system may have to access data from disparate locations.
Defragmentation
It is the process of realigning the logical order of index data with the physical storage on disk, enhancing performance for operations like sequential scans.
Dealing with index fragmentation
Rowstore Indexes:
ALTER INDEX REBUILD: This command involves a complete reconstruction of the index. It is particularly useful for heavily fragmented indexes, where it reorganizes both the leaf and non-leaf levels of the index, improving the storage and retrieval efficiency of rows. This process is resource-intensive as it involves dropping the existing index and creating a new one, which can impact database performance during the operation.
ALTER INDEX REORGANIZE: This operation is more targeted and less intensive. It focuses specifically on defragmenting the index by reordering only the leaf-level pages. This reorganization helps in improving the efficiency of data access without the need for a complete rebuild. It is less resource-intensive and is always performed online, ensuring minimal impact on database availability. This makes it suitable for routine maintenance and addressing minor fragmentation issues.
Columnstore Indexes
ALTER INDEX REBUILD: Beyond defragmentation, this operation can also help to compress data more effectively in a columnstore index. It's used when there's significant fragmentation or after bulk deletions, as it can rebuild the entire index and optimize compression.
ALTER INDEX REORGANIZE: This operation specifically targets the rowgroups within the columnstore index. It merges smaller rowgroups into larger ones, which optimizes the storage and performance. It also cleans up deleted rows from rowgroups, which can further enhance efficiency. But it doesn't inherently recompress the data.
The difference in the rebuild operation reflects the underlying structure of the two types of indexes.
For rowstore, ALTER INDEX REBUILD reorganizes the rows and the B-tree* nodes, while for columnstore, it reorganizes and compresses columns, which can lead to a more profound impact on query performance for analytical workloads.
*A B-tree is a self-balancing tree data structure that maintains sorted data in a way that allows for efficient insertion, deletion, and lookup operations. Read more: https://use-the-index-luke.com/sql/anatomy/the-tree
ALTER INDEX REORGANIZE for rowstore indexes reorders leaf pages for contiguous storage, while for columnstore indexes, it merges and purges rowgroups for optimized storage and compression.
Source:
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/indexes?view=sql-server-ver16
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver16