Overview of Query Retargeting

The use of pre-computed aggregate data is a technique to address performance challenges in Data Warehouse systems (DW). An aggregate retargeting mechanism redirects a query to an available aggregate table(s) when possible.

Performance is a critical issue in DW systems. Pre-computed aggregation is the most effective strategy to reduce the time needed to access large numbers of rows.

Aggregates usage has to be transparent and non-intrusive. The following transparency aspects are addressed by Lyftron:

  • aggregate unawareness - data consumer is not aware that preaggregate was used during query execution
  • execution engine unawareness - data consumer is not aware what execution engine hosts preaggregate used
  • complexity unawareness - data consumer is not aware of preaggregate structure and how many preaggregates were used, if any

The adoption of aggregates should be completely transparent to DW users for two main reasons:

  • users typically do not have the necessary technical skills to make the appropriate choices on how to conveniently use aggregates to improve performance,
  • as decision requirements change constantly, aggregates may have to change accordingly (e.g. removal of obsolete aggregates).

An aggregate retargeting mechanism provides this transparency, by redirecting a query, if possible, to the available aggregate table(s) that can answer the query.

Objectives of Query Retargeting

Objectives are as follows:

  • improve the performance of Data Warehouse and Data Marts
  • improve architecture flexibility
  • allow for a delay of optimization decisions until the performance requirements are fully known
  • allow for ad-hoc optimizations for quick reaction to performance issues in production

How Query Retargeting works

An aggregate, also known as a summary table, is a materialized view that stores redundant, consolidated data. In Lyftron we prefer to use the term "target", because a materialized view does not have to be an aggregate. In fact, any materialized view can be used as target: for example, a wide table can be reduced in a view (using projection) to a few columns that are needed most. Such target can then be used to substitute a wide table in queries issued by users.

The adoption of aggregates requires the resolution of the following:

  • finding and defining the best set of aggregates
  • efficiently maintaining the aggregates when new data is shipped to the data warehouse
  • efficiently computing these aggregate views from multiple sources
  • exploring the aggregates to answer queries.

The current version of Lyftron leaves the finding, defining and maintaining aggregates to the user, usually a skilled data warehousing specialist. In essence, it is a user responsibility to define views that will be considered by the query rewrite engine as targets. For now, a semi-automated or fully-automated preaggregate candidate proposition is not supported.

Query retargeting in action

To illustrate the problem at hand, let us consider the following example. Consider tables below:

FactSales

Prod_Id Month AmountSold
P1 200101 500
P2 200102 500
P1 200101 1000
P2 200102 3000

DimProducts

Prod_Id ProdName
P1 Product 1
P2 Product 2

Suppose that a critical query is to report monthly sales by product:

Month ProdName AmountSold
200101 Product 1 1500
200102 Product 2 3500

The query can be formulated like this:

SELECT Month
     , ProdName
     , Sum(AmountSold) as MonthlySales
  FROM FactSales
  JOIN DimProducts
    ON FactSales.Prod_id = DimProducts.Prod_id
 GROUP
    BY Month, ProdName

Assume that the FactSales table is very large and expected query response time is below one second.

To use query retargeting we have to:

  • define a view - target view name or virtual database it belongs to do not matter
  • define caching of a view - either caching mode will do
  • mark view as 'Query rewrite target'
  • materialize the view by rebuilding the cache

For example, we have to define the following view:

CREATE VIEW v_summary_by_month 
AS 
    SELECT Month
         , ProdName
         , Sum(AmountSold) as MonthlySales
      FROM FactSales
      JOIN DimProducts
        ON FactSales.Prod_id = DimProducts.Prod_id
     GROUP
        BY Month, ProdName

To make it work we also need to enable the caching of the view v_summary_by_month and rebuild the cache (materialize the view).

When a user executes the query:

SELECT Month, ProdName, Sum(AmountSold) as Sales
  FROM FactSales
  JOIN DimProducts
    ON FactSales.Prod_id = DimProducts.Prod_id
 GROUP 
    BY Month, ProdName

Lyftron will rewrite the query and retarget it to a materialized view with ready-made preaggregate. The resulting query would be similar to:

SELECT Month, ProdName, MonthlySales as Sales
  FROM v_summary_by_month

Response time will be much better thanks to utilization of the already prepared results.

More examples of supported variants of the query:

SELECT Month, ProdName, Sum(AmountSold) as Sales
  FROM FactSales
  JOIN DimProducts
    ON FactSales.Prod_id = DimProducts.Prod_id
 WHERE Month BETWEEN (200101 and 200102)
 GROUP
    BY Month, ProdName
SELECT Sum(AmountSold)
  FROM FactSales
  JOIN DimProducts
    ON FactSales.Prod_id = DimProducts.Prod_id
 WHERE Month >= 200101