Tuesday, February 14, 2012

Summary Tables with MySQL

I was recently talking with a few people and the concept of summary tables came up as solutions for them. The issue is, they never thought of it. Hopefully this post will help them as well as others.

Summary Tables are not a new concept. They are exactly what they sound like, basically summary of existing data. Aggregated tables, materialized views and summary tables are very dominate in data warehouses. You can also take advantage of these concepts as part of your reporting systems as well.

So summary tables are real tables. So you can take advantage of indexes and etc with them. In the examples I am using, I consider them more of a summary table than aggregated tables . Depending on your application and needs it could grow into more of an aggregated tables and or materialized views situation.

How you separate your data and tables is dependent on your reporting and application needs.

The following is a high level example of how you can use summary tables.



Let us assume that you have a report that is populated with different website traffic aspects.
This could be a report that includes impressions, hits and leads gathered via your site from some other sources. Other in the company, of course, want to spend money where it best makes money. So they need updated information reliably and do not want to wait for it.

Currently you have table for your reporting tools and you calculate ROI per report request via SQL.

For example:
CREATE TABLE `some_report` (
`traffic_source_id` int(5) unsigned NOT NULL DEFAULT '0',
`ad_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliate_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliates_key` varchar(25) DEFAULT '',
`date_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`impressions` int(5) unsigned DEFAULT '0',
`hits` int(5) unsigned DEFAULT '0',
`leads` int(5) unsigned DEFAULT '0',
UNIQUE KEY `traffic` (`traffic_source_id`,`affiliate_id`,`ad_id`,`date_time`,`affiliates_key`),
KEY `date_time` (`date_time`)
) ENGINE=InnoDB ;

This table is populated hourly, so an insert could be similar to the query below. Your system executes queries like this thousands+ of time per day because it is based on different traffic sources as well as affiliate and their affiliate keys. So this table is going to grow dynamically, hours per day * traffic_source_id * affiliate_id * ad_id * affiliates_key.

For example:
INSERT INTO some_report VALUES (100,4343,9839,'SomeID',DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00'),89832,44916,22458);
INSERT INTO some_report VALUES (100,4343,9839,'SomeID2',DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00'),93332,34716,23438);
INSERT INTO some_report VALUES (100,2343,9839,'SomeID',DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00'),543232,44316,458);
INSERT INTO some_report VALUES (100,2343,9839,'SomeID2',DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00'),89832,24916,2458);
INSERT INTO some_report VALUES (100,5343,9839,'SomeID',DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00'),1239832,2344916,2538);

Just for reference lets assume :
100 = Google.com
4343 = An ad words ad for MySQL
9839 = Some Marketing Company
SomeID = affiliates_key of some marketing company

You have reviewed your explains per query and your SQL query is the fastest you can get it. It just takes time to calculate all the ROIs when reports are spanning a lot of different entities. The web pages are taking to long and people are getting upset.

So you decide to change your tables and create a process to roll up your data across different summary tables using an ETL or Events or cron job based scripts. So your system populates the RETURN_VALUED per ad_id & affiliate_id. You do not know what the affiliate is paying out to other affiliates_keys but you are aware of what you spent and what the lead returns to you.

First start with our lowest common denominator. Here you have data per hour.

CREATE TABLE `some_report_hour` (
`traffic_source_id` int(5) unsigned NOT NULL DEFAULT '0',
`ad_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliate_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliates_key` varchar(25) DEFAULT '',
`date_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`impressions` int(5) unsigned DEFAULT '0',
`hits` int(5) unsigned DEFAULT '0',
`leads` int(5) unsigned DEFAULT '0',
`cost` decimal(9,2) DEFAULT '00.00',
`roi` decimal(7,2) DEFAULT '00.00',
UNIQUE KEY `traffic` (`traffic_source_id`,`affiliate_id`,`ad_id`,`date_time`,`affiliates_key`),
KEY `date_time` (`date_time`)
) ENGINE=InnoDB ;


Then every hour it populates more data into the new table via your ETL / Event / Stored Procedure / Cron job Script. It depends on how you want to handle and decide your values for

For example:
DELIMITER //
CREATE PROCEDURE some_report_hour_procedure(IN affiliate_id int(6) , RETURN_VALUED int(6) , COST decimal(5,3) )
BEGIN
INSERT INTO some_report_hour
SELECT traffic_source_id, ad_id, affiliate_id, affiliates_key, date_time, impressions, hits, leads, COST, ROUND((SUM(leads * RETURN_VALUED ) / COST),2)
FROM some_report r
WHERE r.date_time = DATE_FORMAT(NOW() ,'%Y-%m-%d %H:00:00')
GROUP BY traffic_source_id, ad_id, affiliate_id, affiliates_key;
END //
DELIMITER ;

CREATE EVENT some_report_hour_event
ON SCHEDULE
EVERY 1 HOUR
COMMENT ' just an example. Depends on you how you want to populate these values. This is for just 1 affiliate_id '
DO
CALL some_report_hour_procedure(9839,5,100000); # (affiliate_id , RETURN_VALUED , COST)

So the data is much like the following:

+-------------------+-------+--------------+----------------+---------------------+-------------+---------+-------+-----------+---------+
| traffic_source_id | ad_id | affiliate_id | affiliates_key | date_time | impressions | hits | leads | cost | roi |
+-------------------+-------+--------------+----------------+---------------------+-------------+---------+-------+-----------+---------+
| 100 | 2343 | 9839 | SomeID | 2012-02-14 16:00:00 | 543232 | 44316 | 458 | 100000.00 | 22.90 |
| 100 | 2343 | 9839 | SomeID2 | 2012-02-14 16:00:00 | 89832 | 24916 | 2458 | 100000.00 | 122.90 |
| 100 | 4343 | 9839 | SomeID | 2012-02-14 16:00:00 | 89832 | 44916 | 22458 | 100000.00 | 1122.91 |
| 100 | 4343 | 9839 | SomeID2 | 2012-02-14 16:00:00 | 93332 | 34716 | 23438 | 100000.00 | 1171.91 |
| 100 | 5343 | 9839 | SomeID | 2012-02-14 16:00:00 | 1239832 | 2344916 | 2538 | 100000.00 | 126.90 |
+-------------------+-------+--------------+----------------+---------------------+-------------+---------+-------+-----------+---------+


Nothing is dynamically different but you have removed the ROI from your query. It will be a simple select now. But now you can use your ETL/Event/Cron Job Script to keep the day table updated as well.

CREATE TABLE `some_report_day` (
`traffic_source_id` int(5) unsigned NOT NULL DEFAULT '0',
`ad_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliate_id` int(10) unsigned NOT NULL DEFAULT '0',
`affiliates_key` varchar(25) DEFAULT '',
`date_time` date NOT NULL DEFAULT '0000-00-00',
`impressions` int(5) unsigned DEFAULT '0',
`hits` int(5) unsigned DEFAULT '0',
`leads` int(5) unsigned DEFAULT '0',
`cost` decimal(9,2) DEFAULT '00.00',
`roi` decimal(7,2) DEFAULT '00.00',
UNIQUE KEY `traffic` (`traffic_source_id`,`affiliate_id`,`ad_id`,`date_time`,`affiliates_key`),
KEY `date_time` (`date_time`)
) ENGINE=InnoDB ;

Notice I left the field date_time labeled as date_time. This is assuming you might have your report presented via column names and you just adjust the FROM table on selects based on what type of report needed. This is of course dependent on your application.

Notice that your inserts are going to be “group by” to match your key.
For example:

CREATE EVENT some_report_day_event
ON SCHEDULE
EVERY 1 HOUR
COMMENT 'updates some_report_day hourly'
DO
REPLACE INTO some_report_day SELECT traffic_source_id, ad_id, affiliate_id, affiliates_key , DATE_FORMAT(NOW() ,'%Y-%m-%d') as date_time, SUM(impressions), SUM(hits), SUM(leads), SUM(cost), ROUND( SUM(leads) * 5 / SUM(cost) ,2) as ROI
FROM some_report_hour r
WHERE r.date_time BETWEEN DATE_FORMAT(NOW() ,'%Y-%m-%d 00:00:00') AND DATE_FORMAT(NOW() ,'%Y-%m-%d 23:59:59')
GROUP BY traffic_source_id, ad_id, affiliate_id, affiliates_key;

Now the reports that gather data per day and not per hour are going to be faster as they are going to be simple selects as well.

So the data is much like the following:

+-------------------+-------+--------------+----------------+------------+-------------+---------+-------+-----------+------+
| traffic_source_id | ad_id | affiliate_id | affiliates_key | date_time | impressions | hits | leads | cost | roi |
+-------------------+-------+--------------+----------------+------------+-------------+---------+-------+-----------+------+
| 100 | 2343 | 9839 | SomeID | 2012-02-14 | 543232 | 44316 | 458 | 100000.00 | 0.02 |
| 100 | 2343 | 9839 | SomeID2 | 2012-02-14 | 89832 | 24916 | 2458 | 100000.00 | 0.12 |
| 100 | 4343 | 9839 | SomeID | 2012-02-14 | 89832 | 44916 | 22458 | 100000.00 | 1.12 |
| 100 | 4343 | 9839 | SomeID2 | 2012-02-14 | 93332 | 34716 | 23438 | 100000.00 | 1.17 |
| 100 | 5343 | 9839 | SomeID | 2012-02-14 | 1239832 | 2344916 | 2538 | 100000.00 | 0.13 |
+-------------------+-------+--------------+----------------+------------+-------------+---------+-------+-----------+------+

You can continue this concept for monthly and yearly data.
You could also build summary tables broken down by affiliate_id and day for example.
It really depends on what your after. Just remember data is never deleted but just copied into summary table for ease of use and speed.

This is a very simple example but hopefully it gets you thinking and started on how you can summarize your data easily.