The task is to compress about 100 messages of up to 128 ASCII
characters in up to 15 languages for a small embedded device. The
messages use the code points 32-127 from the ASCII ranges, and some
(about 64) of the other code points can be used to represent shared
substrings. This process is recursive, i.e. with \x81 mapping to
\x82\x82 and \x82 mapping to aa, the string \x81 should expand
to aaaa.
When computing the space savings of a given replacement, the storage
of the mapping from the character value to the substring should be
taken in to account: this is the length of the substring and one extra
byte. Hence, replacing
Note that the code described in this section is deleted (still in the git history, of course).
I used this problem as an opportunity to try out Haskell and made a
first naive/greedy implementation. Unfortunately, the code in
opt5.hs takes more than 8min just to compute the list of all
substrings contained in the 132 messages using less than 2000
characters in total. Compiling with ghc --make -O2 -rtsopts opt5.hs
then running with ./opt5 +RTS -sstderr yields these statistics:
369,927,498,008 bytes allocated in the heap
2,064,371,464 bytes copied during GC
3,285,304 bytes maximum residency (437 sample(s))
192,016 bytes maximum slop
11 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 708519 colls, 0 par 13.56s 13.10s 0.0000s 0.4669s
Gen 1 437 colls, 0 par 2.18s 2.19s 0.0050s 0.2842s
INIT time 0.00s ( 0.00s elapsed)
MUT time 481.24s (486.90s elapsed)
GC time 15.74s ( 15.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 496.99s (502.19s elapsed)
%GC time 3.2% (3.0% elapsed)
Alloc rate 768,691,720 bytes per MUT second
Productivity 96.8% of total user, 95.8% of total elapsed
The compression would require finding the most promising substitution, replacing it in the messages, then regenerating the substrings list. This is going to be unbearably slow.
To run it, you need to uncomment the main function at the bottom of
opt5.hs and provide a strings.txt.
- Better memory efficiency by using ByteString instead of String and representing the substrings as parts of the original input text.
- Try to reduce unnecessary work by using a table of two-letter frequencies as upper bound of how often a substring containing these can occur.
The input (a list of String) is transformed into one big
ByteString by pasting them together, separated by a =NUL= byte.
Each character of the input is verified to be inside a restricted
range to avoid a clash with the separator character or the characters
used to replace substrings.
The substring count must avoid to count overlapping substrings as two
occurences since replacing one of the substrings will prevent to
replace the other (examples: aaaa contains the substring aa only
twice, not three times or ababa contains aba only once).
First a table of
Any substring of 3 letters or more is called an EnnGram. A map is built where each EnnGram is a key and the value is the count of occurences. Starting from a character, EnnGrams are collected using the successive encountered Digrams as upper bounds of how many times a given EnnGram might appear in the text. As soon as the upper bound drops to 1 or below, the process stops, then restarts from that place.
The Digrams and their counts and the EnnGrams and their counts are merged into a list of Candidates (a product type of a substring and its occurence count), sorted by their expected gain when replaced by a single byte.
Next, as many Candidates as possible are popped off the list, making sure that each new Candidate popped off doesn’t overlap with those already used (as this overlap could invalidate the count and hence the expected compression gain).
Then all selected Candidates are replaced and the counting process restarts.
This section is to keep notes about the expected orders of magnitude involved.
-
$M$ messages, with$M <= 100 * 15$ - each message can be up to
$L$ long, with$L < 128$ - we are looking for at most
$S$ substrings with$S ≈ 64$ .
An upper bound for the number of substrings in one message is $$ (L-1) + (L-2) + \ldots + 1 = \frac{L(L-1)}{2} $$ The total number of substrings (assuming they are all distinct) is thus bounded by $$ M\frac{L(L-1)}{2} < 12192000 $$
Thus it is impractical to keep track of the substring counts as arrays whose index represent somehow the substring.