Practice Set 3 (Answers)

Released: March 30, 2015

Question 1. (20 min) Write a function which reads a text file and finds the ten most occurring words in the file.

Solution 1.

The approach is relatively straightforward: read in the file one line at a time. For each line, process each word by either adding an entry into the dictionary or incrementing the count.

Once you have finished processing the file, we have two options:

  1. The solution I use below is to sort the list and splice out the entries that we want.

  2. An alternative would be to use a heap (Python has a built-in Queue library, which contains a PriorityQueue class). There is a bit of a performance overhead in this particular application (since Queue is synchronized and can handle concurrent operations), but in terms of asymptotic performance, it is the same.

Using the first of the two proposals above, here is some Python code:

            
def find_frequent_words(filename, n):
    """
    find_frequent_words() finds the n most frequent words present in the file.
    @param filename is the name of the file to be opened; assumed to be valid
    @param n is an integer corresponding to the number of words to return; assumed to be non-negative
    @return a list of length n containing the words
    """
    word_counts = process_file(filename)    # Returns a map of words to keys.
    sorted_words = sorted(word_counts.items(), key=lambda entry: -entry[1])

    if len(sorted_words) < n:
        return sorted_words
    else:
        return sorted_words[:n]

def process_file(filename):
    """
    process_file() reads contents of file and counts the number of times each word appears in the file.
    @param filename is the name of the file to be opened; assumed to be valid.
    @return a dictionary mapping word to associated count
    """
    counts_by_word = {}
    with open(filename, 'r') as file:
        for line in file:
            for word in line.split():
                token = word.lower()
                if token in counts_by_word:
                    counts_by_word[token] += 1
                else:
                    counts_by_word[token] = 1
    return counts_by_word

            
        

This is one area in which Python really shines: dealing with files. The overhead needed to read in files with Java is rather significant. In class, we have been using the Princeton standard library to help simplify file I/O. In general, assume you don't have access to that library. See this tutorial for a quick example of how to deal with file input using only Java's standard libraries.

Question 2. (30 min) Write a function which takes as input a tree and returns a list of all duplicate subtrees. For example:

Input:    1         Output: [ 2  ,  3  ]
         / \                [  \       ]
        2   2               [   3      ]
         \   \
          3   3    
        

Solution 2.

In the interest of time, I won't provide code for this but I will walk through a solution. The basic idea I came up with is to precompute a hash of the subtree at each node, and to store this information in a map (key being the node, value being the hash). The challenge comes from computing this hash efficiently: if we started traversing the tree from the root, we would need to recursively visit each of its descendants. This is clearly inefficient! Instead, we would do better by traversing the tree from the leaves up, and computng the hash at a node by using its value and the hashes of its children.

After this preprocessing step, traverse the tree beginning at the root. As we traverse the tree, look up the hash of the node. Maintain a set of hashes that we have seen so far, and whenever you pull up a hash that is already in the set, add the corresponding subtree to the result.

Using this method, the preprocessing step runs in \(O(n)\) time and the traversal also runs in \(O(n)\) time, so the overall runtime is \(O(n)\).

< Back to home