Microsoft Office Tutorials and References
In Depth Information
First of all, let us perform a quick analysis of file size. Even if—for modern hard disks—file size
seems not to be an important parameter, we need to point out that file size is a good
indicator of how much data Excel needs to scan to provide the results of the PivotTable. Clearly, the
differing file sizes of the standard PivotTable and PowerPivot matter when the computation
is carried on by Excel. When you use SSAS PivotTables, the computation is carried on by the
server, so you cannot have any information about the OLAP cube size and memory
management. Moreover, for SSAS PivotTables, the Excel file contains only the results of the query, not
the original data.
To perform a comparison, we made a very simple test with a worksheet with 121,318 rows and
19 columns (see CH08-01-FileSizeClassical.xlsx on the companion DVD). We saved this data in a
standard Excel workbook, and then we loaded it into a PowerPivot table, saving the workbook
with a different name (see CH08-02-FileSizePowerPivot.xlsx).
The original Excel workbook occupies 17,170 kilobytes (KB), whereas the PowerPivot workbook
occupies only 1,811 KB, resulting in an 88 percent space saving. So if you ever need to share
a workbook with somebody else, through Microsoft SharePoint or a simple e-mail message,
reducing the size of the workbook is very sensible.
About Columnar Data Storage
The reason for the very compact PowerPivot file size is that PowerPivot is a columnar
data store with automatic deduplication of data. If a value is found more than once in a
column, it is stored only once, and all the other occurrences are replaced by a reference
to the first value. So if the product color Black occurs 1,000 times, it does not occupy
the size of Black multiplied by 1,000 but a much smaller amount of memory.
The rationale behind this data deduplication algorithm is that databases that need to
be analyzed through a PivotTable normally contain a lot of duplications. If a column
contains many different values, it is not very useful in a PivotTable because we cannot
produce interesting aggregates over it. Apart from the key columns, all other attributes
in a table should end up with a low number of different values to make them useful
aggregators. Because PowerPivot performs data deduplication, it does not matter
how many rows are present in the source table: if the number of distinct values of
a column is low, the file size is small.
We are not particularly concerned with computation speed because it is hard to measure.
Both Excel and PowerPivot engines run at a blazing speed, and from the user point of view,
the difference between 0.5 seconds and 0.05 is not noticeable.