Spring 2023
In this lab, you will implement Huffman Coding. As in Lab 6, no skeleton is provided. In addition, you are expected to make good choices of data structures to implement the encoding, decoding, and tree construction efficiently.
You will complete this lab in pairs. You will begin work on this in class (held in CF 420) on Monday 5/22, continue working during your lab, and complete it by the usual deadline of Sunday night.
The Github Classroom link for Lab 7 is available in the Lab 7 assignment on Canvas. The workflow for group assignments is a little different:
First, find out who your partner is on Canvas by checking the Lab
7 Groups in the People tab; the groups are named
{section}_{num}
where {section}
is your
section number and {num}
is your group number within your
section.
The first member of your pair to accept the Github Classroom
invite should create a team with the following name:
the WWU usernames of the two teammates, ordered lexicographically, and
separated by an underscore. For example, if Josh (one of our TAs) and I
were working on this lab together, our team would be named
kovacj_wehrwes
.
The second member of the pair to accept the Github Classroom invite should find the team created by the first member and join it.
There is no skeleton code, so your repository will start out empty,
and as in Lab 6, you’ll start by creating a fresh Gradle project to work
in. Please see the Lab 6 handout for a refresher
on how to create a gradle project. Name your project lab7
,
use lab7
as the package name, and put your main program in
Huffman.java
.
Building a working implemenation of Huffman Coding involves implementing the following operations:
Count frequencies. Given an input
String
, calculate the frequency (i.e., number of
occurrences) of each character in the string.
Build the tree. Given the frequencies from part (1), build a Huffman Coding Tree.
Decode. Given a coding tree and an encoded bitstring, decode it into the original input string.
Encode. Given your coding tree and an input string, encode the string into its compressed binary representation.
Main Program. Finally, you’ll need to write a main program that demonstrates the above steps in action.
Your task is to implement each of the above steps as efficiently as possible. This requires thinking carefully about which data structures to use and/or design for each task. Start by thinking through each algorithm (you may find it helpful to write pseudocode) and determining what data structures will allow you to complete them most efficiently.
Your approach should meet the following asymptotic efficiency targets:
If your data structure choices do not meet the above targets, don’t proceed to implementation; continue thinking about how to hit these efficiency targets and ask me or the TA for help if you get stuck.
Huffman.java
should have a main
method that
runs when you type gradle run
from the project root
directory. The behavior of the main program is as follows:
The program takes one command-line argument that specifies a filename.
The program reads the contents of that file as the input string.
The program builds a Huffman Coding Tree for that input, and encodes the string using the constructed tree.
If the length of the input string is less than 100 characters, print the following three things, one per line:
Regardless of the length of the input string, print the following two things, one per line:
A boolean that confirms programmatically (i.e., using the
equals
method) that the input and the decoded output are
the same.
The compression ratio, calculated as
length(encoded bitstring) / length(input) / 8.0
*
* We divide by 8 to show what the compression ratio would be if we stored the encoded string as a true bitstring (1 bit per 0 or 1), rather than a String (1 byte = 8 bits per 0 or 1); see Representing Inputs and Bitstrings, below.
Here are a few sample invocations:
$ cat example0.txt
feed
$ gradle run -q --args "example0.txt"
Input string: feed
Encoded string: 110010
Decoded string: feed
Decoded equals input: true
Compression ratio: 0.1875
$
$ cat example1.txt
beef feed fed calf
$ gradle run -q --args "example1.txt"
Input string: beef feed fed calf
Encoded string: 0011111101101011111000101011100010110011000001001
Decoded string: beef feed fed calf
Decoded equals input: true
Compression ratio: 0.3402777777777778
$
$ gradle run -q --args "GreatExpectations.txt"
Decoded equals input: true
Compression ratio: 0.5672382983174206
$
Your bitstrings won’t necessarily be identical based on tie-breaking choices in your code. As far as I know, since these are optimal codes, the compression ratios should match. Your output does not need to be formatted identically to mine, but your output should follow the guidelines above, including printing one thing per line and in the correct order.
You can find the GreatExpectations.txt
file used in the
last run here.
Ideally you’d implement the above tasks in the order 1, 3, 4, 2, 5. However, since I haven’t written extensive test suites for you this time, you can’t test encoding or decoding until you have a tree. For this reason, I recommend implementing the tasks in the order listed, except that you should (at a minimum) start building up your Main Program code method as you go to test your code.
You don’t need to write rigorous unit tests, but you should convince yourself that each step works before moving onto the next. You don’t want to write code for all four steps and then find out that “it doesn’t work”; this will leave you with a lot of code where the bug(s) might be. This probably looks like printing out the results of a given step and comparing them to what you expect to see on a few different inputs (ideally not all trivially small). Inventing a few well-crafted test inputs is probably worth your while.
It’s up to you to come up with a sensible structure for your project;
the only requirement is that your main program lives in
Huffman.java
.
You can (and should) make use of any data structures from the Java Collections framework, and/or any data structures that we have implemented so far in this class.
To use data structures from prior projects, we can create a
.jar
file that packages up the classes from the project and
include it as a dependency in our lab7
project. Here are
the steps for doing this; I’ll explain the process using A3, but it
works similarly for A2.
In the project you want to use (in this case, A3), edit
lib/build.gradle
and add the line
'java-library' id
inside the plugins
block.
Run gradle build
. This will generate a
.jar
file in lib/build/libs
.
Copy this file into your lab7
project into
app/libs
(if the libs
directory doesn’t exist,
create it). The name of the jar file doesn’t really matter; I renamed
mine heap.jar
, and I’ll assume that’s your jar file’s name
in the next step.
Edit app/build.gradle
in your lab7
project; inside the dependencies
block, add the following
line:
files('libs/heap.jar') implementation
In your code, you should now be able to put
import heap.Heap
among any other imports needed, and make
use of the A3 Heap in you code.
If you want to be sure that your dependencies are correct, you are
welcome to use either the built-in Java data structures, or download my
solution .jar
files using the links below and follow only
steps #3 and #4 above.
The corresponding Java collections are
java.util.PriorityQueue
(this has a somewhat different
interface from ours, in that the priority is determined by the
compareTo
method instead of by a separate priority value),
java.util.HashMap
, and java.util.TreeSet
or
java.util.TreeMap
.
Usually we use techniques like this to compress data before storing it in a file or sending it over a network. To properly be able to reconstruct the original input, you’d need to store not just the encoded bitstring but also a representation of the coding tree you used to construct it.
To keep things a little simpler, we will not worry about doing this. This means that when we calculate the compression ratio, we are being generous to ourselves and actually giving a lower bound on the compression ratio that would be achieved if the tree is stored. For long inputs, the size of the tree becomes small compared to the length of the encoded string.
In this lab, we’ll use characters (or length-1 strings) as the symbols that make up our input string. As mentioned in the lecture video, we could make other choices here if we wanted our algorithm to work on non-String inputs.
In a real-world implementation, we would want to represent our
bitstrings as efficiently as possible. The right way to do this in Java
is probably to use a BitSet
.
However, to keep things simple, I recommend simply storing the bitstring
as a String
containing only zeros and ones. This incurs a
storage cost of 8x, because a String will store each 0 or 1 as a byte (8
bits) where an efficient representation would store each 0 or 1 in a
single bit.
One efficiency gotcha with Strings
comes from the fact
that they are immutable; this means concatenation involves allocating a
new string and copying both strings’ contents into the new memory. If
you’re building a string by repeately adding small pieces to it, this is
quite inefficient. Fortunately, there’s a StringBuilder
class that allows you to collect all the pieces first, then perform just
one O(length) operation at the end to concatenate them all.
Convincing yourself that you’ve built a correct tree is probably the
trickiest testing task. Feel free to look at the
AVL.printTree()
method from the A2 codebase for ideas on
how to print out a readable representation of a tree.
You do not need to follow any specific rules to decide which node should be the left vs. right child when merging two nodes. Similarly, you do not need to follow any specific rules to break ties among equal-frequency trees. You will end up with an optimal tree either way.
Submit your project using Github. You should include in your repository any input files you created in the course of testing your program. Finally, fill out the Lab 7 Survey on Canvas to report hours worked and how it went.
This lab is worth 10 points, 2 points for each of the 5 tasks completed. If you are not able to get working code for all 5 tasks, your main method should include - to the extent possible - code that demonstrates the parts that are working correctly. For example, if you completed Step 1 but you don’t have a working encoding/decoding pipeline, you could print a table of character frequencies to help convince me that the frequency counting code works. If your program does not demonstrate code for a task, you cannot earn more than 1/2 for that task.
As usual, deductions will be made for style issues. In particular, since there is no skeleton code, make sure that you have included proper specifications for all of your methods.