Overview of Query Retargeting
The use of pre-computed aggregate data is a technique to address performance challenges in Data Warehouse systems (DW). An aggregate retargeting mechanism redirects a query to an available aggregate table(s) when possible.
Performance is a critical issue in DW systems. Pre-computed aggregation is the most effective strategy to reduce the time needed to access large numbers of rows.
Aggregates usage has to be transparent and non-intrusive. The following transparency aspects are addressed by Lyftron:
- aggregate unawareness - data consumer is not aware that preaggregate was used during query execution
- execution engine unawareness - data consumer is not aware what execution engine hosts preaggregate used
- complexity unawareness - data consumer is not aware of preaggregate structure and how many preaggregates were used, if any
The adoption of aggregates should be completely transparent to DW users for two main reasons:
- users typically do not have the necessary technical skills to make the appropriate choices on how to conveniently use aggregates to improve performance,
- as decision requirements change constantly, aggregates may have to change accordingly (e.g. removal of obsolete aggregates).
An aggregate retargeting mechanism provides this transparency, by redirecting a query, if possible, to the available aggregate table(s) that can answer the query.
Objectives of Query Retargeting
Objectives are as follows:
- improve the performance of Data Warehouse and Data Marts
- improve architecture flexibility
- allow for a delay of optimization decisions until the performance requirements are fully known
- allow for ad-hoc optimizations for quick reaction to performance issues in production
How Query Retargeting works
An aggregate, also known as a summary table, is a materialized view that stores redundant, consolidated data. In Lyftron we prefer to use the term "target", because a materialized view does not have to be an aggregate. In fact, any materialized view can be used as target: for example, a wide table can be reduced in a view (using projection) to a few columns that are needed most. Such target can then be used to substitute a wide table in queries issued by users.
The adoption of aggregates requires the resolution of the following:
- finding and defining the best set of aggregates
- efficiently maintaining the aggregates when new data is shipped to the data warehouse
- efficiently computing these aggregate views from multiple sources
- exploring the aggregates to answer queries.
The current version of Lyftron leaves the finding, defining and maintaining aggregates to the user, usually a skilled data warehousing specialist. In essence, it is a user responsibility to define views that will be considered by the query rewrite engine as targets. For now, a semi-automated or fully-automated preaggregate candidate proposition is not supported.
Query retargeting in action
To illustrate the problem at hand, let us consider the following example. Consider tables below:
Suppose that a critical query is to report monthly sales by product:
The query can be formulated like this:
SELECT Month , ProdName , Sum(AmountSold) as MonthlySales FROM FactSales JOIN DimProducts ON FactSales.Prod_id = DimProducts.Prod_id GROUP BY Month, ProdName
Assume that the FactSales table is very large and expected query response time is below one second.
To use query retargeting we have to:
- define a view - target view name or virtual database it belongs to do not matter
- define caching of a view - either caching mode will do
- mark view as 'Query rewrite target'
- materialize the view by rebuilding the cache
For example, we have to define the following view:
CREATE VIEW v_summary_by_month AS SELECT Month , ProdName , Sum(AmountSold) as MonthlySales FROM FactSales JOIN DimProducts ON FactSales.Prod_id = DimProducts.Prod_id GROUP BY Month, ProdName
To make it work we also need to enable the caching of the view
v_summary_by_month and rebuild the cache (materialize the view).
When a user executes the query:
SELECT Month, ProdName, Sum(AmountSold) as Sales FROM FactSales JOIN DimProducts ON FactSales.Prod_id = DimProducts.Prod_id GROUP BY Month, ProdName
Lyftron will rewrite the query and retarget it to a materialized view with ready-made preaggregate. The resulting query would be similar to:
SELECT Month, ProdName, MonthlySales as Sales FROM v_summary_by_month
Response time will be much better thanks to utilization of the already prepared results.
More examples of supported variants of the query:
SELECT Month, ProdName, Sum(AmountSold) as Sales FROM FactSales JOIN DimProducts ON FactSales.Prod_id = DimProducts.Prod_id WHERE Month BETWEEN (200101 and 200102) GROUP BY Month, ProdName
SELECT Sum(AmountSold) FROM FactSales JOIN DimProducts ON FactSales.Prod_id = DimProducts.Prod_id WHERE Month >= 200101