Program of the Day #16

In this program, you will implement an algorithm to compress and decompress text art (sometimes called ASCII art) images using a technique called Run-Length Encoding (RLE). RLE is a simple form of data compression that works particularly well for data with lots of consecutive repeated values - exactly what we find in many text art images.

Your program will work with text art that uses only two characters: spaces and the asterisk character *. This creates a simple binary image. Here’s an example of what such an image looks like:

    ******    
  **      **  
 *          * 
*  **    **  *
*            *
*  *      *  *
 *  ******  * 
  **      **  
    ******

Run-Length Encoding

Run-length encoding works by replacing sequences of the same character with a count and the character. For example, the sequence “AAABBBCCC” would be encoded as “3A3B3C”. This is especially efficient for data with many repeated characters in a row.

Your Tasks

You will implement the following functions:

RLE Format Specification

We will use the following format for RLE encoding: 1. Each sequence of identical characters is represented by a number followed by the character, where the number indicates how many times the character is repeated. 3. Encode each line independently, with newline characters separating lines in the encoded file. Each line in the input should correspond to one line in the encoded output and vice versa.

For example, if a line in the image contains two asterisks, three blank spaces, then four asterisks, like this:

**   ****

then the RLE encoding would be:

2*3 4*

Where 2* means “2 asterisks”, 3 means “3 spaces”, and 4* means “4 more asterisks”.

Implementation Details and Hints

Calculating Compression Ratio

The encode_rle function returns the compression ratio, which is the ratio of the length of the encoded file to the length of the original file. For example, a compression ratio of 0.5 would mean that the encoded file is half the size of the original.

To keep things simple, since we are encoding line by line, do not count newlines in the length of either the input file or the output file - the length of a file should be calculated as the sum of the lengths of its lines (without the newline character).

Helper Functions

The tests operate on the encode_rle and decode_rle functions only. However, I have included optional headers for two recommended helper functions, encode_line and decode_line. You may want to implement and manually test these functions first; this will make encode_rle and decode_rle simpler to write.

Parsing Repeated Characters and Digits

The encode function needs to find the length of a sequence of repeated characters, while the decoding function needs to read a sequence of pairs of (number, character) from the encoded file. If you need a hint about how to think about this, here’s my recommendation:

Encoding: look at the first character and start a counter at 1. While the next character matches whatever that first character was, keep increasing the counter by 1. When done, the counter stores the number of repeate characters, as well as an index into the string of the beginning of the next sequence to process.

Decoding: to read the number, we can do something similar to the encoding approach, except instead of each character matching the first, we need each character to be a digit. Python strings have an isdigit() method (e.g., '9'.isdigit() returns True) that’s useful here. Once you’ve found where the digits stop, you know how much of the string should be converted to an integer.

Testing

The first time you run the test file, it will create a folder called P16_img, and populate it with a number of test images. There are six example input images, named P16_img?_raw.txt, and their corresponding encoded files are P16_img?_rle.txt, where ? is the image number from 0 through 5. The test program also writes additional output files that it uses for testing to this directory.

Other Practice Problems

  1. Implement the following function:

    def split_address(addr_line):
        """ Split the postal address in address_line into its
        component pieces. Return a tuple of strings containing:
            (number, street, city, state, zip).
        Precondition: the address matches the following format:
            "<number> <street>, <city> <state> <zip>"
        Example: split_address("516 High St, Bellingham WA 98225")
        => ("516", "High St", "Bellingham", "WA", "98225") """
  2. Download lyrics.txt. Write a program that counts and prints the number of unique lines in the file. Be sure that the text file and your Python program are saved in the same directory.

  3. Implement the following function:

    def grep(string, filename):
        """ Print all lines of the file filename that contain the given string.
        Precondition: the file exists. """
  4. Implement the following function:

    def spellcheck(in_filename, out_filename, wordlist):
        """ Write a spellchecked version of in_filename to
        out_filename. For each word in the input file, write
        it as-is to the output file if it is in the wordlist;
        otherwise, write it to the output file in ALLCAPS to
        indicate that it's not in the wordlist. """