Skip to content

pcrama/message-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem statement

The task is to compress about 100 messages of up to 128 ASCII characters in up to 15 languages for a small embedded device. The messages use the code points 32-127 from the ASCII ranges, and some (about 64) of the other code points can be used to represent shared substrings. This process is recursive, i.e. with \x81 mapping to \x82\x82 and \x82 mapping to aa, the string \x81 should expand to aaaa.

When computing the space savings of a given replacement, the storage of the mapping from the character value to the substring should be taken in to account: this is the length of the substring and one extra byte. Hence, replacing $n$ occurrences of a substring $l$ long yields a gain of $n l - n - (l + 1)$. ̇

First attempt

Note that the code described in this section is deleted (still in the git history, of course).

I used this problem as an opportunity to try out Haskell and made a first naive/greedy implementation. Unfortunately, the code in opt5.hs takes more than 8min just to compute the list of all substrings contained in the 132 messages using less than 2000 characters in total. Compiling with ghc --make -O2 -rtsopts opt5.hs then running with ./opt5 +RTS -sstderr yields these statistics:

369,927,498,008 bytes allocated in the heap
  2,064,371,464 bytes copied during GC
      3,285,304 bytes maximum residency (437 sample(s))
        192,016 bytes maximum slop
             11 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
 Gen  0     708519 colls,     0 par   13.56s   13.10s     0.0000s    0.4669s
 Gen  1       437 colls,     0 par    2.18s    2.19s     0.0050s    0.2842s

 INIT    time    0.00s  (  0.00s elapsed)
 MUT     time  481.24s  (486.90s elapsed)
 GC      time   15.74s  ( 15.29s elapsed)
 EXIT    time    0.00s  (  0.00s elapsed)
 Total   time  496.99s  (502.19s elapsed)

 %GC     time       3.2%  (3.0% elapsed)

 Alloc rate    768,691,720 bytes per MUT second

 Productivity  96.8% of total user, 95.8% of total elapsed

The compression would require finding the most promising substitution, replacing it in the messages, then regenerating the substrings list. This is going to be unbearably slow.

To run it, you need to uncomment the main function at the bottom of opt5.hs and provide a strings.txt.

Second attempt

  • Better memory efficiency by using ByteString instead of String and representing the substrings as parts of the original input text.
  • Try to reduce unnecessary work by using a table of two-letter frequencies as upper bound of how often a substring containing these can occur.

How it works

Validating input

The input (a list of String) is transformed into one big ByteString by pasting them together, separated by a =NUL= byte. Each character of the input is verified to be inside a restricted range to avoid a clash with the separator character or the characters used to replace substrings.

Substring counts

The substring count must avoid to count overlapping substrings as two occurences since replacing one of the substrings will prevent to replace the other (examples: aaaa contains the substring aa only twice, not three times or ababa contains aba only once).

Digram table

First a table of $256 ⋅ 256 = 65536$ entries is built to count the occurences of each digram (2 letter grouping). This phase is also used to make sure that no substrings straddling (e.g. the last character of a string and the first character of the next) strings are included as candidate for completion by the next step. For this, any digram containing the separator has its count forced to 0.

EnnGram map

Any substring of 3 letters or more is called an EnnGram. A map is built where each EnnGram is a key and the value is the count of occurences. Starting from a character, EnnGrams are collected using the successive encountered Digrams as upper bounds of how many times a given EnnGram might appear in the text. As soon as the upper bound drops to 1 or below, the process stops, then restarts from that place.

Candidates

The Digrams and their counts and the EnnGrams and their counts are merged into a list of Candidates (a product type of a substring and its occurence count), sorted by their expected gain when replaced by a single byte.

Greedy algorithm

Next, as many Candidates as possible are popped off the list, making sure that each new Candidate popped off doesn’t overlap with those already used (as this overlap could invalidate the count and hence the expected compression gain).

Then all selected Candidates are replaced and the counting process restarts.

Space or Time Complexity

This section is to keep notes about the expected orders of magnitude involved.

Assumptions

  • $M$ messages, with $M <= 100 * 15$
  • each message can be up to $L$ long, with $L < 128$
  • we are looking for at most $S$ substrings with $S ≈ 64$.

Number of substrings

An upper bound for the number of substrings in one message is $$ (L-1) + (L-2) + \ldots + 1 = \frac{L(L-1)}{2} $$ The total number of substrings (assuming they are all distinct) is thus bounded by $$ M\frac{L(L-1)}{2} < 12192000 $$

Thus it is impractical to keep track of the substring counts as arrays whose index represent somehow the substring.

About

Proof of concept looking at replacing common substrings with unused ASCII code points

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published