Azure Stream Analytics - Analytics Functions
Azure Stream Analytics - Analytics Functions
SQL OVER
The OVER() clause in SQL, especially in the context of Azure Stream Analytics, is very versatile and powerful. It defines the window or range of data over which the function is computed. Within the OVER() clause, you can specify several elements that determine how the function should be applied to the set of rows. Here's a breakdown of the elements you can specify inside the OVER() clause:
1. PARTITION BY:
Purpose: Divides the result set into partitions to which the function is applied.
Use Case: If you have multiple users and you want to calculate a metric separately for each user.
2. ORDER BY:
Purpose: Specifies the order in which the rows are processed within each partition.
Use Case: If you want to calculate a running total, you would order by the column representing the sequence of events.
3. LIMIT DURATION:
Purpose: In Azure Stream Analytics, it specifies the duration of the window over which the function is computed.
Use Case: When you are interested in calculating a metric over a specific time period, like the last 1 hour, you would use LIMIT DURATION (hour, 1).
4. WHEN:
Purpose: Acts as a filter within the window to include only rows that meet the specified condition.
Use Case: If you are interested in rows with a specific attribute, like Event = 'start', within the window.
5. ROWS BETWEEN:
Purpose: Specifies the range of rows around the current row within the partition to be included in the window.
Use Case: When calculating moving averages or other rolling metrics.
Sample code explanation - LAG, OVER:
The code retrieves the current and previous readings of sensors within a 1-hour window, then calculates the difference (growth) between these readings for each sensor.
Sample Input Data:
Step 1: Applying LAG with 1-hour Limit on recordedTime
Step 2: Calculating Growth
The entry at 11:10 for SensorA has no previous reading within the 1-hour limit, hence the growth is NULL.
Sample code explanation - LAST, OVER:
The code calculates the time difference (in seconds) between a user's 'start' and 'end' interactions with a feature within a 1-hour window.
SQL execution can be visualized in a specific order, even if the code is written differently. It helps to understand how the clauses operate:
FROM clause determines the table or set of data you're working on.
WHERE clause filters the data.
SELECT clause operates on the remaining dataset and returns the final result.
For window functions like LAST with OVER, it is executed as part of the SELECT step, but it has access to a window of rows related to the current row, not just the current row itself. This window of rows is defined by the PARTITION BY and other windowing conditions inside the OVER clause.
Now, let's apply that understanding to the query in question:
FROM input TIMESTAMP BY Time: This is our data source. It contains both 'start' and 'end' events with their corresponding timestamps.
WHERE Event = 'end': This means we are only interested in rows where the Event column is 'end'. After this filter, only 'end' events remain in the dataset.
Inside the SELECT clause, for each 'end' event, we compute:
LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour, 1) WHEN Event = 'start'): For the current 'end' event, this goes back and finds the Time of the most recent 'start' event for the same user and feature within the past hour. If there isn't a corresponding 'start' event within the hour, it will return NULL.
DATEDIFF(second, <time of start event>, Time): Now, for each 'end' event, we have both the time of the most recent 'start' event and its own time. This computes the difference in seconds between them.
So, even though we filtered for only 'end' events using the WHERE clause, the window function inside the SELECT clause allows us to look back and find the corresponding 'start' event's time for each 'end' event.
In essence, the query is saying: "For each 'end' event, go back and find the last 'start' event for the same user and feature within the past hour, then compute the difference in seconds between them."
SQL - order of operation
In SQL, the logical order of operation (how the database processes the SQL statement) is different from the order in which you write clauses. Here's the general logical order of execution for SQL keywords:
FROM: Specifies the table or tables to retrieve data from.
JOIN: Combines rows from two or more tables based on a related column between them.
WHERE: Filters the results based on a condition.
GROUP BY: Groups rows that have the same values in specified columns.
HAVING: Filters the results after aggregation.
SELECT: Specifies which columns to display in the result.
DISTINCT: Removes duplicate rows from the result set. \\\???
ORDER BY: Sorts the result set in ascending or descending order.
LIMIT / OFFSET: Specifies the number of records to return (and where to start).
source:
https://learn.microsoft.com/en-us/stream-analytics-query/lag-azure-stream-analytics
https://learn.microsoft.com/en-us/stream-analytics-query/last-azure-stream-analytics