Compression v Enumeration
… And the winner is …
Over the holiday last month, I came across a well-written article on column-oriented databases. The author, a BI consultant named Bojan Ciric, wrote an excellent summary of column-oriented databases, one you might want to read here, if you’re new to the topic.
In the article, the author cautions that column-oriented databases pay a penalty for decompression each time data is read. The presumption seems to be that if data is compressed in size, there must concomitantly be a de-compression step.
This is not exactly the case with Sybase IQ. More importantly, this discussion provides an opportunity to delve a bit into the “how” of sophisticated column oriented databases, using Sybase IQ, and how it achieves both storage space reduction and query performance improvements, without incurring a noticeable penalty to decompress data as the example.
Sybase IQ is the industry’s leading column-oriented database, and, as Mr. Ciric describes, stores and retrieves by column, permitting very selective retrieval in response traditional SQL. But beneath the covers, there’s more to it. Sybase IQ enumerates column data into index structures prior to storage. This is different than compression, and more effective.
The resulting indices, when combined with an smart, index-aware SQL query processor, together yield dual benefits – a large reduction in the database storage footprint, and a net reduction in analytic and reporting query times, as compared to traditional row-based databases.
How’s this done?
Internal to Sybase IQ, and invisible to the query, most data types are enumerated, during load, insert or update, into one or more of a rich family of index structures. Indices are available in Sybase IQ optimized for low- and high-cardinality data, use in aggregations, use in “like” operations, and for specific data types like date and time.
Invisibly, at least from the query’s perspective, enumeration reduces the size of the required structure quite dramatically – ranging from 10 to 90% depending on data type, cardinality and width. This directly drops the storage footprint, hence the confusion about whether the result is being achieved by compression, or more accurately, by enumeration.
The benefits of enumeration go beyond reduced reliance on de-compression. Most importantly, the resulting data structures, can be selectively retrieved – if a query predicate, whether explicit or deduced by the query engine affects only a fraction of a column, then only a fraction of that column’s index need be retrieved.
For low-cardinality data, the most commonly used index structures are embodied by : a) a list of occurrences within the column, which points to a series of b) a data structures that identifying occurrences of those values with the column. These create far smaller structures than the source data using only enumeration.
But what about high cardinality data, you ask?
For high cardinality data, whether textual or binary, Sybase IQ also enumerates these columns into index types that store these data efficiently, and explicitly use efficient decompression to reduce storage size. [If you find yourself asking if “efficient de-compression” is an oxymoron, rest assured it’s not.]
Did I say compression? Yes I did. Sybase IQ does use low-cost LZW compression for all pages written to storage, achieving an additional benefit. LZW compression and decompression, when compared to storage latency, reduces the number of pages sufficiently that, in most cases, the cost of LZW de-compression is made up for by improvements in I/O performance.
To be specific, we measure only about 4 milliseconds per 128kb page to decompress on a typical modern CPU core using LZW. Yet, by doing so, we reduce the number of pages stored and retrieved such that storage latency, or its avoidance by reducing data volume stored, more than makes-up for the cost of the de-compression step.
The takeaway points:
Sybase IQ combines enumeration, an intelligent query processor with modest compression to achieve multiple benefits, without a “visible” de-compression cost:
- Enumeration, like compression, can greatly reduce size of stored data, but without concomittant de-compression costs.
- Through enumeration, index structures are created that are directly accessible by the query processor and optimized to provide the query processor with the fastest-possible query plans
- The indexes into which column cells are enumerated can, by their design, permit selective retrieval, fetching only pages that contain data and parts of the index that are relevant to the particular query.
- Compression, while used, employes only a low-cost LZW algorithm, used for all pages, at a cost in terms of CPU time that is more than made up for by the reduction in overall I/O operations.
So, in reality, Mr. Citric is right, there is a cost due to de-compression in Sybase IQ and other column databases, but that cost is typically so small as to be washed-out by I/O latency improvements it makes possible. It is the enumeration, not compression that dramatically reduces the overall I/O, makes selective retrieval possible, speeds query processing time and slashes storage footprint.
Until next time,