08 Aug 2024

CMU 15-445 notes: Storage Models & Compression

This is a personal note for the CMU 15-445 L5 notes.

1. Database workloads

Characterized by fast, repetitive, simple queries that operator on a small amount of data, e.g., a user adds an item to its Amazon cart and pay.
Usually more writes than read.

Store all attributes for a single tuple contiguously in a single page, e.g., slotted pages.
Pros: good for queries that need the entire tuple, e.g., OLTP.
Cons: inefficient for scanning large data with a few attributes, e.g., OLAP.

Store each attribute for all tuples contiguously in a block of data, i.e., column store.
Pros: save I/O; better data compression; ideal for bulk single attribute queries like OLAP.
Cons: slow for point queries due to tuple splitting, e.g., OLTP.
2 common ways to put back tuples:
- Most common: use fixed-length offsets, e.g., the value in a given column belong to the same tuple as the value in another column at the same offset.
- Less common: use embedded tuple ids, e.g., each attribute is associated with the tuple id, and the DBMS stores a mapping to jump to any attribute with the given tuple id.

Rows are horizontally partitioned into groups of rows; each row group uses a column store.
A PAX file has a global header containing a directory with each row group’s offset; each row group maintains its own header with content metadata.

Disk I/O is always the main bottleneck; read-only analytical workloads are popular; compression in advance allows for more I/O throughput.
Real-world data sets have the following properties for compression:
- Highly skewed distributions for attribute values.
- High correlation between attributes of the same tuple, e.g., zip code to city.
Requirements on the database compression:
- Fixed-length values to follow word-alignment; variable length data stored in separate mappings.
- Postpone decompression as long as possible during query execution, i.e., late materialization.
- Lossless; any lossy compression can only be performed at the application level.

Block level: compress all tuples for the same table.
Tuple level: compress each tuple (NSM only).
Attribute level: compress one or multiple values within one tuple.
Columnar level: compress one or multiple columns across multiple tuples (DSM only).

Engineers often use a general purpose compression algorithm with lower compression ratio in exchange for faster compression/decompression.
E.g., compress disk pages by padding them to a power of 2KBs and storing them in the buffer pool.
- Why small chunk: must decompress before reading/writing the data every time, hence need o limit the compression scope.
Does not consider the high-level data semantics, thus cannot utilize late materialization.

The most common database compression scheme, support late materialization.
Replace frequent value patterns with smaller codes, and use a dictionary to map codes to their original values.
Need to support fast encoding and decoding, so hash function is impossible.
Need to support order-preserving encodings, i.e., sorting codes in the same order as original values, to support range queries.
- E.g., when SELECT DISTINCT with pattern-matching, the DBMS only needs to scan the encoding dictionary (but without DISTINCT it still needs to scan the whole column).

Compress runs (consecutive instances) of the same value in a column into triplets (value, offset, length).
Need to cluster same column values to maximize the compression.