Assignment 5: Indexing text files

This assignment is inspired by a real job interview question. Your task is to write a program that prints an index of all of the distinct words found in a set of text files, including the line numbers in each file that each word is found on.

In its original context, the interviewee would have a couple of hours to solve this problem from scratch; the only standard-library classes they could use were String and FileReader. The time complexity of their solution had to be sub-quadratic in the total number of words for a wide range of inputs (including alphabetized files). Finally, their solution was expected to work for the proctor the first time it was run.

Inexperienced programmers, worried about the time limit, might try to start by writing lots of custom data structures and postponing testing until the end; this does not go well. Experienced programmers instead start with a prototype built from standard-library collections and functions, test it right away, and then gradually replace individual components with their own code while keeping the program working. In this assignment we will model that approach.

Learning objectives

Compose library implementations of ADTs to solve a problem.
Implement iterative algorithms that process indexed sequences, guided by precise specifications and loop invariants.
Design an effective hash function for strings.
Implement the Map ADT using a hash table with linear probing.
Use binary search on a fixed key set as an alternative to a dynamic Map.
Leverage object-oriented features to refactor program modules while maintaining a functional solution.

Recommended timeline

Day 1: Complete CollectionsIndexerApp and verify with end-to-end testing.
Day 2: Adapt solution to implement Indexer.writeIndex() and DictIndexer.index(). Implement and test Sorting.deduplicateSorted(). Verify by testing JavaDictIndexer.
Day 3: Implement and test sortDistinct() and related methods in Sorting.
Day 4: Implement and test ProbingStringDict. Verify ProbingDictIndexer.
Day 5: Implement and verify TwoPassIndexer.

Note that this timeline does not include the challenge extensions. If you plan on tackling these, be sure to leave an extra couple of days at the end of your schedule.

Assignment overview

Our goal is to create a program that, given a list of filenames as program arguments, prints an index of the distinct words found in those files.

As an example, suppose we had the following three text files:

wegmans.txt

banana

avocado

aldi.txt

avocado banana
Banana
banana banana

tops.txt


avocado

If we ran our program with arguments wegmans.txt aldi.txt tops.txt wegmans.txt, then we would expect it to print the following output:

Output of “java cs2110.IndexerApp wegmans.txt aldi.txt tops.txt wegmans.txt”

AVOCADO
	aldi.txt 1
	tops.txt 2
	wegmans.txt 3
BANANA
	aldi.txt 1 2 3
	wegmans.txt 1

Note the following features:

In the input files, words are separated by whitespace, and lines are numbered starting from 1.
In the index, words are converted to upper case (making the index case-insensitive).
In the index, words are printed in alphabetical (lexicographic) order.
Under each word, the names of the files that word is found in are printed, one per line in alphabetical order and indented with a tab (\t) character. Filenames that appear multiple times in the program arguments are only printed once.
After each filename, the line numbers on which the word occurs are printed, separated by spaces. If a word occurs multiple times on the same line, the line number is only printed once.

Additionally, our program needs to satisfy a performance requirement: its time complexity must be smaller than $O(N^2)$, where $N$ is the total number of words across all of the input files. We don’t need to guarantee this in the absolute worst case, but we should expect it for “typical” inputs (including files whose contents are alphabetized). In practice, this means that hash tables with a good hash function are okay, as is QuickSort with a reasonable pivot selection heuristic, but binary search trees will cause problems if they are not actively balanced. (Algorithms like Merge Sort, Heap Sort, and repeated binary search are suitable without any qualifications.)

Assignment flow

At a very high level, this assignment will evolve a solution through the following phases:

A prototype (likely a single main() method) that combines Java’s collection classes in order to represent the desired output. (TODO 1)
Introducing user-defined types to break up the solution into testable phases and to reduce dependence on library classes for ADTs that are less performance-critical. Wrapping Java classes for the remaining ADTs keeps things working. (TODOs 2-5)
Implementing custom data structures and algorithms for the remainder of the solution: a probing hash table class and a sorting function. (TODOs 6-17)
Thinking outside of the box to leverage simpler data structures and algorithms in a way that doesn’t quite fit more typical ADT usage. (TODO 18)
(Challenge extensions) Exploring other data structures and solution approaches to evaluate their performance and simplicity.

Please work on these phases one at a time, in order. The goal is for you to experience the refinement of your solution through each phase.

Part 1: A prototype using Java collections

Files: CollectionsIndexerApp.java
Tests: Handout example
Other relevant classes: Java Collections API, Java I/O classes (FileReader, Scanner)

The index we are looking for can be modeled as a composition of the basic ADTs we have discussed in this course, including Dictionaries, Lists, and Sets. Java even has SortedMap and SortedSet types that guarantee traversal order, and the List interface provides a convenient sort() method. Additionally, reading text files line-by-line and word-by-word is such a common task that Scanners can do this without any additional configuration. Strings can even make their letters uppercase for you.

So let’s dive right in and quickly write an indexer application using these built-in tools. Complete TODO 1 in CollectionsIndexerApp.java, and do this before proceeding with the rest of the assignment. You may use any classes and functions from java.util and java.io that you want, and indeed, you should leverage them as much as possible. But you may not use any of the types or functions from the rest of this assignment project, nor may you define additional classes in this file—your CollectionsIndexerApp.main() must stand alone. You are free to ignore (that is, to propagate) any IOExceptions arising from opening or reading the files. However, you should still ensure all opened files are closed (e.g., with a “try-with-resources” statement).

Test your application by running it with arguments wegmans.txt aldi.txt tops.txt wegmans.txt (these files are included in the project), and confirm that its output matches the expected output above. Congratulations—you have a working prototype! Take the time to analyze the time complexity of your solution, then respond to reflection question #1. Note that Java’s collections classes and sorting functions all specify the asymptotic complexity of their operations in their API documentation.

Part 2: Introducing modularity

Files: Indexer.java, DictIndexer.java, Sorting.java (partial)
Tests: JavaDictIndexerTest, SortingTest (partial)
Other relevant classes: WordOccurrences, SourceLines, IndexedSeq, StringDict, JavaDictIndexer, WordScanner (optional)

The previous task gave you a pure “client” experience. Composing built-in collection classes in a single method can be a convenient way to get something working quickly, but this lack of modularity has its downsides. You can’t write unit tests when everything is done in one function, and you can’t substitute components unless they align with Java’s interfaces. It can also be difficult to interpret the elements of different collections when they are nested too deeply. So let’s introduce some domain-specific, user-defined types to help break up the task. This will make it easier to replace components one at a time with versions we write ourselves.

Let’s start by separating the “input” and “output” features of this task. We’ll need a data type to represent the state of an index in memory, independent of how it might be printed. What we want is a sequence of words, each associated with filenames and line numbers. If the sequence is sorted alphabetically by word, and the filenames and line numbers for each word are also sorted, then the task of printing the index should be straightforward.

Let’s work inside-out. We can imagine a SourceLines type that associates a list of distinct line numbers with a particular filename. (Note: we’ll sometimes use the word “source” instead of “file”; you can treat them as synonyms for this assignment.)

UML class diagram of SourceLines class — Class diagram of our `SourceLines` type.

The addLine() method will have a non-trivial precondition: line numbers must be added in non-decreasing order. This is reasonable, since clients will need to read a file one line at a time anyway, top-to-bottom. And this precondition will make it easy for SourceLines to deduplicate line numbers—if the line number being added is the same as the previous line number, it is ignored. This should make is straightforward to meet the postcondition of lines(), which promises that the numbers it returns will be distinct and in ascending order.

Each distinct word found in our input files needs to be associated with one or more SourceLines objects. We’ll model this association with a WordOccurrences type. Similar to the situation with SourceLines, its addOccurrence() method requires that we add all occurrences from one file, in line number order, before adding any occurrences from another file. This again seems reasonable, since a client will probably only read one file at a time, from top to bottom. This means that WordOccurrences never needs to do any dictionary-like lookups on the source name—a new occurrence will either come from the last-used source or from a new one. Since a Stack’s peek() method is much simpler to implement than a Dictionary’s get() method, this should reduce complexity during implementation and analysis.

UML class diagram of WordOccurrences and SourceLines classes — Class diagram of our `WordOccurrences` type, showing that it is composed of one or more `SourceLines` objects.

To keep this assignment focused, we’ve implemented both of these abstractions for you.

Now we can think of an “index” as being an alphabetized sequence of WordOccurrences objects, which we represent with the type Iterable<WordOccurrences>. So an “indexer” is a type with two behaviors: it can produce an index given a sequence of filenames, and, given an index, it can print it out to a PrintWriter. We’ll formalize this functionality with the abstract class Indexer, which has an abstract index() method and a helper function writeIndex(). It’s writeSourcesIndex() performs the end-to-end task, including sorting and deduplicating input filenames before calling index().

2a. Indexer output

Start by adapting the output portion of your CollectionsIndexerApp solution to use these new types. Implement Indexer.writeIndex() (TODO 2), then test it using the provided testWriteIndexHandoutExample() testcase (run the version for JavaDictIndexerTest).

Custom collection interfaces

Many students probably approached CollectionsIndexerApp by declaring some kind of Map (aka Dictionary) from “words” to “information about where those words occurred”. With our new abstractions, we can write this as Map<String, WordOccurrences>. Your solution may have also used a List in one or more places, since they’re easy to sort. Our goal is to produce a solution without using any Java collections, so we’ll need replacements for these types. Since Java’s List and Map interfaces declare a lot of methods, we’ve simplified things by declaring simplified versions of our own: IndexedSeq and StringDict. IndexedSeq should look familiar from A3, but here, we’ll be assuming an array-backed implementation with fast get() and set() methods (so it’s fine to call get(i) in a loop). StringDict declares basic Dictionary operations assuming that keys are always Strings; note that it does not guarantee that values will be iterated in sorted-key order.

When defining custom types that are similar to built-in collections, it’s a good habit to create an implementation of that type that is backed by a built-in collection. This way, you have a high-confidence point of comparison to test against as you work on your custom version. Here, JavaIndexedSeq and JavaStringMap serve this purpose. You’ll test your indexer using these classes to start with, before switching to your own implementations.

Again, to keep this assignment focused, we’ve provided an implementation of IndexedSeq for you in DynamicArrayIndexedSeq. The only missing functionality relates to sorting. We have also provided a WordScanner class, which you might find easier than using Java’s I/O classes directly (but it’s up to you whether or not you want to use this going forward).

2b. Dictionary-based indexers

Consider solving the indexer problem using a StringDict<WordOccurrences>. That is, for each word you encounter in a source, you’ll update its corresponding WordOccurrences object (creating a new one if necessary). To get all of the WordOccurrences in alphabetical order, you can add them to an IndexedSeq, then call sortDistinct() on the sequence.

Implement this solution approach in the DictIndexer class (TODO 3). You’ll note that the class has two abstract methods for creating StringDicts and IndexedSeqs (the latter was declared in Indexer); use those whenever you need to make a new collection.

The JavaDictIndexer subclass uses our Java-collections-backed versions of these types, so you’ll quickly be able to test your index() implementation. But there’s one missing piece: JavaIndexedSeq is relying on Sorting.deduplicateSorted() to implement its sortDistinct() method, since Java doesn’t have a convenient deduplication method of its own. Implement Sorting.deduplicateSorted() (TODO 4) along with appropriate test cases in SortingTest (TODO 5).

You should now be able to run JavaDictIndexerTest to confirm that it works just as well as your original CollectionsIndexerApp solution.

UML class diagram of Indexer subtypes and test suites — Class diagram of implementations of the `Indexer` interface and associated unit test suites (method return types have been elided). Classes colored gold should not be modified.

Part 3: Replacing Java library functionality

Files: Sorting.java, ProbingStringDict.java
Tests: SortingTest, StringDictTest, ProbingDictIndexerTest
Other relevant classes: IndexedSeq, IPair, StringDict, ProbingDictIndexer

You have successfully refactored your solution to be more modular and to take advantage of some domain-specific types. It is easier to test, and hopefully easier to read as well. But we’re still relying on Java’s standard library for the “heavy lifting.” The next step is to write your own Dictionary class and sorting functions. You may approach these two tasks in either order, depending on whether you want to practice loop invariants or hashing first.

3a. Deduplicating QuickSort

We want a function that will sort all of the unique elements in an IndexedSeq, truncating the sequence so that it only contains one copy of each distinct value. We could do this in two steps, leveraging the deduplicateSorted() function you wrote earlier, but perhaps we could speed things up by eliminating duplicates during the sorting process itself. You will write a variation on “3-way QuickSort” to achieve this.

As a reminder, the basic procedure of QuickSort is:

Choose a pivot
Partition the sequence about the pivot
Sort the left partition
Sort the right partition

In 3-way QuickSort, the “partition” step keeps track of the first and last index of all elements equal to the pivot (using the “Dutch national flag” algorithm). In our variant, we will eliminate all elements equal to the pivot, which may necessitate shifting the right partition to the left while sorting to fill in the gap.

To meet our performance requirement, we cannot simply choose the first or last element of the sequence as the pivot (since that would lead to $O(N^2)$ time complexity for input that is already sorted). A common heuristic is to use the median of the first, last, and middle value of the array. It’s time to dig out your med3() function from Assignment 1! Re-implement med3() (TODO 6), adapting it to take Comparable arguments. You can test it using SortingTest.Med3Test (which we imported from A1 for you).

Next, you’ll need to understand the specification for partitionShiftedDiscardPivot(). Read it carefully, referring to the following array diagrams to check your understanding (here, “X” indicates “unspecified” or “garbage”—these values shouldn’t be read, and it doesn’t matter what they are upon return; in contrast, “?” indicates the values to be partitioned):

Precondition:

	dst	begin	end
items:	X	?

Postcondition:

	dst	i	j	end
items:	< pivot	X	> pivot

Implement and test partitionShiftedDiscardPivot() (TODOs 7-8). Your loop will introduce at least one variable in addition to the parameters and return values shown above; be sure to include a loop invariant comment giving that variable a precise interpretation. Remember that your tests must not assert anything about unspecified elements.

Finally, turn your attention to quicksortDistinct(). Read its specification carefully, checking your understanding against the following array diagrams (here, p denotes your chosen pivot):

Precondition:

	dst	begin	end
items:	X	?

Postcondition:

	dst			ret	end
items:	< p, sorted	p	> p, sorted	X

Implement quicksortDistinct() recursively, making effective use of med3() and partitionShiftedDiscardPivot(), and add thorough testing to SortingTest (TODOs 9-10). You may add additional private helper methods in Sorting if you like.

At this point, DynamicArrayIndexedSeq should be a fully-functional replacement for JavaIndexedSeq. Your unit tests should give you confidence in your work, and you’ll see it in action to solve the main task after completing the next part as well.

3b. Hash tables with linear probing

To replace JavaStringDict, you’ll implement a Dictionary class based on a hash table, resolving collisions with linear probing. We have provided an outline of this class in ProbingStringDict. Hash tables tend to be very bug-prone, so you’ll start off with testing.

Conveniently, we already have a working implementation of our StringDict interface in JavaStringDict. Add thorough test cases in StringDictTest (TODO 11), then verify that the JavaStringDictTest suite passes all of them. Use the makeDict() (or fromMap()) helper to initialize your dictionaries under test so that this suite can be used to test multiple implementations.

When you are confident in your test coverage, turn your attention to ProbingStringDict. Note that, rather than relying on Java’s hashCode() method, you need to implement your own hash() function for strings (TODO 12). This is your chance to get creative! You can combine the characters of the key however you want, so long as the end result is reasonably resistant to collisions.

Next, study the class invariant for ProbingStringDict carefully, then complete TODOs 13-17 to implement the class’s functionality. Use ProbingStringDictTest to evaluate your progress against the test cases you wrote earlier. You will likely encounter bugs. Have patience, and draw diagrams to trace what your code is doing. Note that we have loosened the encapsulation here by marking fields and helper methods as protected, which means you can examine them from additional test cases defined in ProbingStringDictTest (not in the StringDictTest superclass) if this helps.

At this point, ProbingStringDict should be a fully-functional replacement for JavaStringDict.

3c. Putting it all together

Once both Sorting and ProbingStringDict are completed, run ProbingDictIndexerTest to ensure that your DictIndexer works with these new implementations. Congratulations—you’ve solved the problem using a minimum of Java library code!

Part 4

Files: TwoPassIndexer.java
Tests: TwoPassIndexerTest
Other relevant classes: Sorting

Hash tables are an extremely versatile data structure, but they might make you nervous because of their poor worst-case time complexity. What would happen if an unlucky (or malicious) input file contained lots of words whose hash codes collided? Balanced trees provide better guarantees for dictionary operations, but that’s a topic for CS 3110. Perhaps there is a way to solve this problem without using typical dictionary data structures?

Observe that, if you knew all of the words ahead of time, you could put them all in a sorted, deduplicated list. Then you could use binary search to find the index of any word. If you had a second list of WordOccurrences objects, you could then use that index to quickly look up the corresponding object. This pair of lists acts like a Dictionary with guaranteed $O(\lg N)$ lookup time; the catch is that you can’t add any new keys.

But no one said you couldn’t read the input files more than once. Consider a “two-pass” approach: the first time you read the files, you just collect all of the words in a DynamicArrayIndexedSeq. Then, after sorting and deduplicating, you read the files a second time to populate the index. Implement this approach in TwoPassIndexer (TODO 18).

Challenge extensions

Alternative `StringDict` implementation (2 points)

A “prefix tree”, also known as a “trie”, is a data structure well suited for Sets of strings and Dictionaries with string keys. It provides guaranteed fast access to keys, at the expense of space overhead. The idea is that each node in the tree has a potential child for each letter in the alphabet (stored in an array). Each leaf node, as well as some interior nodes, corresponds to a key, whose letters will be spelled out by tracing a path from the root to the node. Navigating to a key’s node therefore involves recursively choosing a child based on each successive letter of the key.

For the basic version we have described, it is important to restrict the alphabet used by potential keys (aside: there are 65,536 possible char values, and allocating that many array elements in every trie node would be cost-prohibitive). For this portion of the assignment, you may assume that we will only test your TrieStringDict with words whose upper-case characters are in the range 'A'-'Z' or '0'-'9'. Each of your nodes should therefore contain an array of 36 potential children, along with a way to indicate whether the node corresponds to a key, and, if it does, the value associated with that key.

Implement a generic class TrieStringDict<V> implementing StringDict<V>; all of its code should be written in TrieStringDict.java. A node’s children must be stored in an array of length 36. You may not use any external classes besides String, DynamicArrayIndexedSeq, and NoSuchElementException (and obviously the StringDict and Iterable interfaces). You may declare nested classes (e.g., for a “node” or “iterator” type). Your TrieStringDict must have a no-argument constructor creating an empty dict.

Finding associated words (3 points)

Some pairs of words are likely to be found near each other in documents. For example, the word “NIAGARA” is likely to be followed by the words “FALLS” and “GRAPES”. Given an index produced by an Indexer, we have the information we’d need to search for words that appear on the same line multiple times across a collection of input files.

Implement the following function in a class named WordAssociates:

/**
 * Return which words are found on the same line (in the same source) as the
 * word `word` at least `threshold` times across the sources indexed in `index`.
 */
static Iterable<String> associatedWords(Iterable<WordOccurrences> index, String word, int threshold);

As an example, if index is a reference to an index built from the seven text files in the menus folder of the release code, then associatedWords(index,"CHEF'S",3) should return an Iterable<String> over four elements: “-” (used as formatting punctuation in these files), “CHOICE”, “SIDES”, and TABLE".

Do not use any classes or functions except for DynamicArrayIndexedSeq and those of the function’s parameters. You may declare helper functions (this is advised), so long as they are well specified.

You are encouraged to write test cases to verify the functionality of your associatedWords() function, but these tests do not need to be submitted.

Submission

In “reflection.txt”, estimate the amount of time you spent on this assignment, and answer the verification and reflection questions. Then submit the following files to Gradescope:

reflection.txt
CollectionsIndexerApp.java
DictIndexer.java
Indexer.java
ProbingStringDict.java
Sorting.java
TwoPassIndexer.java
SortingTest.java
StringDictTest.java
TrieStringDict.java* (if completing challenge extension)
WordAssociates.java* (if completing challenge extension)

Some Gradescope Submission Reminders

Only one member of each group should create a submission in Gradescope. The person who submits should add their partner to the submission group by clicking on the submission and then selecting “View or edit group” in the right menu.
You may submit to Gradescope multiple times, but only your most recent submission will be graded. Note that you must re-upload all files each time you submit, and you must ensure that you re-form your submission group for your final submission.
While Gradescope allows you to upload zip files of your entire IntelliJ project, we recommend that you individually upload the files enumerated by the list in the preceding section to build your submission. While we have attempted to make the autograders robust to different submission folder structures, we only guarantee proper behavior when the files are uploaded in this way. The smoketester should confirm that you have uploaded all of the required files and alter you to any that are missing/extra.

Collaboration Policy

On this assignment you may work together with one partner. Having a partner is not needed to complete the assignment: it is definitely do-able by one person. Nonetheless, working with another person is useful because it gives you a chance to bounce ideas off each other and to get their help with fixing faults in your shared code. If you do intend to work with a partner, you must review the syllabus policies pertaining to partners under “programming assignments” and “academic integrity.” Remember to form a group with your partner when you submit to Gradescope.

As before, you may talk with others besides your partner to discuss Java syntax, debugging tips, or navigating the IntelliJ IDE, but you should refrain from discussing algorithms that might be used to solve the problems, and you must never show your in-progress or completed code to another student who is not your partner. Consulting hours are the best way to get individualized assistance at the source code level if you get stuck diagnosing an issue on your own.

Frequently asked questions

If needed, there will be a pinned post on Ed where we will collect any clarifications for this assignment. Please review it before asking a new question in case your concern has already been addressed. You should also review the FAQ before submitting to see whether there are any new ideas that might help you improve your solution.