Azure Synapse Analytics - Slowly Changing Dimension (SCD)
Azure Synapse Analytics - Slowly Changing Dimension (SCD)
SCD stands for Slowly Changing Dimension. In data warehousing, a Slowly Changing Dimension represents how you manage changes in a dimensional model over time. There are several types of Slowly Changing Dimensions (SCD), each with different strategies to handle changes in dimension attributes. Here are the most common types of SCD:
Type 0 – Retain Original
Ignores any changes and retains the original values. The dimension attribute is never updated.
Type 1 – Overwrite
Updates the dimension attribute with the new value overwriting the original value, meaning that history is not preserved.
Type 2 – Add New Row
Keeps historical data by adding a new row in the dimension table for each change, preserving the existing rows to maintain history. Typically, a start and end date, along with a current indicator flag, are used to identify the validity period of each record.
Type 3 – Add New Attribute
Preserves limited history by adding a new attribute to the dimension table. It keeps the original value in one column and the current value in another column. This approach is suitable when you need to preserve the immediate previous value.
Type 4 – Add History Table
Uses a separate table to store historical changes, known as a history or audit table, which tracks the changes over time along with corresponding timestamps.
Type 5 – Mini-Dimension with Type 1 and Type 2
Type 5 SCD, often called the "Mini-Dimension" approach, segregates frequently changing attributes into a separate mini-dimension table, utilizing a Type 1 strategy (overwriting old data). The main dimension typically uses a Type 2 strategy, adding new rows to preserve historical changes. This method optimizes performance and simplifies queries by isolating rapidly changing data.
Type 6 – Hybrid
It is a combination of Types 1, 2, and 3, incorporating aspects of each to maintain history and allow for attribute overwriting. Typically, it might have current values overwritten like in Type 1, maintain a full history of rows like in Type 2, and have additional columns to store previous values like in Type 3.
Conversion to SCD type2
When designing a Slowly Changing Dimension (SCD) to keep a record of changes, typically a Type 2 SCD is used as it allows preserving the history of attribute changes. For Azure Synapse Analytics dedicated SQL pool, you would generally do this by creating a new row for each change, keeping the previous rows, and updating the effective dates to reflect the time range for which each row is valid.
Here’s a simplified example of how you might manage the SCD:
1. Initial State:
When you first insert a row, you might set the Effective Start Date to the current date, and the Effective End Date to a far future date (or null) to indicate that this is the current version of the row.
2. Upon Modification:
When the row data is modified:
You would end-date the current row by setting its Effective End Date to the current date.
Insert a new row with the changed data, setting its Effective Start Date to the current date and its Effective End Date to a far future date (or null).
Assign a new, unique surrogate key to the new row, ensuring that it can be distinctly identified and referenced independently of the previous version.
Considerations:
Ensure that the Surrogate Primary Key is unique for each row, possibly using an IDENTITY column or other unique value.
Regularly review old data, possibly moving very old versions of rows to a history table or other archival storage if they are no longer needed for current analysis, to manage table size.
Common SCD keys:
A surrogate key is an artificial, system-generated unique identifier for a record in a database table. It is used to uniquely identify each record in that table. Surrogate keys have no business meaning and are used primarily to serve as the primary key for the table.
Here are the main characteristics and features of surrogate keys:
System-Generated: Surrogate keys are typically auto-generated by the system. Commonly, databases might use auto-incrementing integer fields or globally unique identifiers (GUIDs) as surrogate keys.
Uniqueness: Each surrogate key value is unique, ensuring each record in the table can be distinctly identified.
No Business Meaning: A surrogate key doesn't convey any business semantics. It is simply a unique identifier.
Immutable: Once assigned, the value of a surrogate key for a specific record should not change over time.
Sequential (in many cases): Often, surrogate keys (especially when implemented as auto-incrementing integers) are sequential, but this is not a strict requirement.
Simplifies Relationship Management: In relational databases, surrogate keys are often used in foreign key relationships, as they provide a simple and consistent mechanism to link records across tables without relying on business-specific fields.
A business key, also known as a natural key, is a type of unique identifier that is used in business processes or systems to identify specific business objects or records. Unlike surrogate keys, which are often system-generated numbers without inherent business meaning, business keys have a recognizable meaning or value in the real world and are often used in daily business operations.
For instance, in a table storing product information, the ProductCode or SKU (Stock Keeping Unit) could be a business key because it's an identifier that might be used by the business to track, order, or reference a product. This identifier is meaningful in the context of the business.
Characteristics of business keys:
They have a real-world meaning or significance.
They are often used in business processes or systems to identify specific records.
Their values are typically not generated by the system but are determined by the nature or attributes of the data they represent.
Business keys can change, though ideally not frequently, due to business requirements or other reasons.
An audit key, in the context of databases and data warehousing, typically refers to a unique identifier associated with data audit records. These records track changes, actions, or events that occur within a system. An audit key allows organizations to maintain a trail of who did what and when, facilitating accountability, security, and compliance.
Features and characteristics of an audit key include:
Uniqueness: The audit key is unique for every individual audit record.
System-Generated: Similar to a surrogate key, an audit key is usually system-generated to ensure its uniqueness.
No Business Meaning: An audit key doesn't have inherent business meaning. It's used for tracking and reference within the auditing system.
Associated Metadata: An audit record with an audit key often contains metadata like the timestamp of the action, user ID of the person initiating the action, the type of action (insert, update, delete), and potentially old and new values for updated fields.
Immutable: Once generated, an audit key does not change. This immutability ensures a reliable audit trail.