Window Aggregates over Data Streams

Window Aggregate Queries

 

Many applications need to process streams, for example, financial data analysis, network-traffic monitoring, telecommunication monitoring, and transportation-traffic data. Database researchers are building Data Stream Management systems (DSMS) so that applications can issue queries to get timely information from streams. Managing and processing streams gives rise to challenges that traditional database systems do not have.

 

An important class of queries over data streams is window aggregate queries. Consider an online auction system in which bids on auction items are streamed into a central auction processing system. The schema of each bid is: <auction-site, item-id, bid-price, timestamp>. Query 1 shows an example of a window aggregate query:

 

Query 1: “Count the number of bids for each auction site in the past 4 minutes and update that result every 1 minute.”

 

SELECT auction-site, count (*)

FROM bids [WATTR timestamp

                    RANGE 4 minutes

                    SLIDE 1 minute]

GROUP-BY auction-site

 

Query 1 is called a time-based sliding-window aggregate query, and it calculates the count of bids from each auction site for each 4-minute sub-stream that starts every 1 minute, with respect to the timestamp attribute of the data. In Query 1 shown above, we introduce a window specification with three parameters: RANGE specifies the window size, SLIDE specifies how the window moves, and WATTR specifies the windowing attribute on which that the RANGE and SLIDE parameters are defined. The window specification of Query 1 breaks the bid stream into window extents (4-minute overlapping sub-streams). There are other flavors of windows, such as tuple-based (or, count-based) window.  

Query Evaluation – WID

 

The intuition of our window query evaluation is to transform a window query into a GROUP-BY query, grouping by each window extent. Therefore, we introduce Window-ID concept into both window semantics and window query evaluation. Briefly, given a window aggregate query, each window extent is assigned an ID; and as each input tuple arrives, the WIDs of the window extent to which the tuple belongs are calculated and tagged to it as WID data attribute. Then, using WID attribute as an additional grouping attribute, a window aggregate query can be transformed into a GROUP-BY query.

 

For example, assume Query 1 starts at 12:00 PM, and use non-negative integers as WIDs, window 6–10 for Query 1 and their associated WIDs are as follows.

 

                                                                                                    window extent                    WID

                                                                        12:03 PM – 12:07 PM                6

                                                                        12:04 PM – 12:08 PM                7

                                                                        12:05 PM – 12:09 PM                8

                                                                        12:06 PM – 12:10 PM                9

                                                                        12:07 PM – 12:11 PM                10

 

Then, given an input tuple with timestamp 12:06:04 PM, it belongs to window 6 through 9.

 

A query plan using WID to evaluate Query 1 is shown below in Figure 1, in which the bucket operator tags WIDs to each input tuple. In addition, assuming that input tuples arrive in timestamp order, the bucket operator can generate punctuations to indicate end of window extents. For example, p1 is a punctuation indicating the end of window 6 for auction site 3. In our window query evaluation, punctuation is generally used as a mechanism to handle disorder in the input data.

                                

 

Optimization – Panes

 

Using panes is an optimization approach for evaluating sliding-window aggregate queries that reduces the required buffer size by sub-aggregating the input stream and reduces the computation cost by sharing sub-aggregates when computing window aggregates. This approach sub-aggregates the stream over non-overlapping sub-sub-streams, which we call panes; then it aggregates over the pane-aggregates to get window-aggregates. Figure 2 illustrates how panes are used to evaluate Query 1. The stream is separated into 1-minute panes; each 4-minute window is composed of four consecutive panes. In Figure 2, w1 – w5 are windows and w3 is composed of panes p3 – p6. Each pane contributes to four windows; for example, p5 contributes to w2 through w5. To evaluate Query 1, we calculate the count for each pane; the count for each window is computed by summing the count of the counts of the four panes that contribute to that window. Intuitively, panes benefit the evaluation of sliding-window aggregates as long as there are “enough” tuples per pane.

Papers

 

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker.  Semantics and Evaluation Techniques for Window Aggregates in Data Streams.  To appear in SIGMOD 2005.

 

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker.  No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. To appear in SIGMOD Record, March, 2005.

Jin Li, David Maier, Vassilis Papadimos, Peter Tucker and Kristin Tufte.  Evaluating window aggregate queries over streams. Technical Report, May 2004.