Basic Theory
Run Length Encoding (RLE) is a very simple compression algorithm. It is efficient for data that has chunks that are repeated.
Data: ABCDEEEEEEEFGHIJKLMNOP
^^^^^^^ (the letter E occurs 7 times in a row)
Data RLE encoded: ABCDE$6FGHIJKLMNOPRLE works by translating the extra E's into a repeat code. The repeat code consists of a reserved character, $, followed by the number of extra E's, six.
Space Savings
The point of data compression is to store the same information in fewer bytes. The data, "EEEEEEE", is seven bytes long, but the RLE encoded version, "E$6", is only three bytes. Each sequence of identical bytes can be encoded into three bytes. Because our repeat number is stored in a byte, we can only specify 255 repeated characters.
Limitations
One character must be reserved as the count code. If that character is present in the data, we must reserve one of the repeat codes to represent it. One choice would be to choose the highest repeat count to represent the reserved character. In our example, that would be two bytes: $ and 255. These two bytes could then be followed by a repeat code, so sequences of the reserved character can be RLE'd just like any other.
Thus there is an expansion of one byte for every instance or sequence of the reserved character that must be encoded.
Optimizations
The following optimizations may be incorporated into an RLE encoder. They are listed in order of how hard it would be to program them.
Save the 3 lost bytes
If a sequence is encoded into 3 bytes, it makes sense to only encode sequences of length 4 or more. Thus:
E -> E EE -> EE EEE -> EEE EEEE -> E$3 E x 255 -> E$254 E x 256 -> E$254E E x 257 -> E$254EE E x 258 -> E$254EEE E x 259 -> E$254E$0
This also means that you will never generate a code with $0, $1, or $2. If we make $0 represent 3 repeats, then our maximum $254 would represent 257 repeats. This expands our maximum sequence length by three bytes. This saves space for each sequence that is between 256-260 bytes long. {{{Better method has $0 = three repeats E -> E EE -> EE EEE -> EEE EEEE -> E$0 E x 255 -> E$251 E x 256 -> E$252 E x 257 -> E$253 E x 258 -> E$254 E x 259 -> E$254E E x 260 -> E$254EE E x 261 -> E$254EEE E x 262 -> E$254E$0 }}}
Choose the best repeat number to represent the reserved character
So far, we have $255 representing the reserved character, $. Now imagine that our data has a sequence of length 255, but no sequences of length 254. We would have to encode that 255-char sequence as four bytes: A$254A
A better method would be to scan the data for a sequence length that does not occur. This would be used to represent the reserved character. This number could be stored at the beginning of the RLE encoded data so the decoder will know what it is. This increases the data size by one byte.
Choose the best reserved character
Because of the added byte per instance of the reserved character, it makes sense to choose a reserved character that occurs the fewest times (or sequences) in the data set. The reserved character could be placed as the first byte in the RLE encoded data. This way the decoder will know what it is. This is a single byte increase in the size of the data.
To choose the best reserved character, the encoder should walk through the data and count the number of bytes that would be required to encode each kind of character. This means setting up a set of 256 counters, one for each character. As the encoder walks through the data, it should treat each character as if it was the reserved character, and count the number of bytes that would be generated. After the data has been examined, the character with the smallest count should be used as the reserved character.
Reserved char count code add 2 (not 3)
A reserved char takes two bytes per single char encoding, so three chars takes six. To encode the three reserved characters as a sequence requires only four bytes. This means that for sequences of three reserved characters, we are using two extra bytes:
$ = $255 $$ = $255$255 $$$ = $255$255$255 $$$$ = $255$0 $ x 257 = $255$253 $ x 258 = $255$254 $ x 259 = $255$254$255 $ x 260 = $255$254$255$255 $ x 261 = $255$254$255$255$255 $ x 262 = $255$254$255$0
A better method has $0 for the reserved character be only two repeats. This also means that $254 for the reserved character is 256 repeats.
$ = $255 $$ = $255$255 $$$ = $255$0 $$$$ = $255$1 $ x 257 = $255$254 $ x 258 = $255$254$255 $ x 259 = $255$254$255$255 $ x 260 = $255$254$255$0 $ x 261 = $255$254$255$1 $ x 262 = $255$254$255$2
Depending on the characteristics of your data, a three byte sequence of the reserved character may be more likely than a 257 byte squence. If that is the case, then #0 should represent two repeats, and #254 should represent 256 repeats.
Scratch Space / Ideas
How's about this - RLE compression using modes. Keep not only runs of a single character, but runs of uncompressed characters, so less space is wasted outputting special control codes for the control code character.
