Sunday, December 19, 2010

Member SRM 491 - Div1 600, PrefixTree

At first it was easy to get misguided by the examples and think that maybe sorting each of words and then building a trie would work, but that is not the case. Even one of the examples actually fails with this solution.

Anyway, once I noticed that was not going to work, I just took a look to the constraints. There can only be at most 16 words in the input, which suggested me that the intended solution is likely exponential.

I think the main trick is to notice the relation between intersection and the optimal solution. For example, with two words "aaaxy" and "yaaz" , there is a intersection "aay" (order does not matter). It should be easy to see that the optimal trie would be formed as long as we make sure that the intersection is a prefix of both words: For example, "aayxa" , "aayz". Note that the order inside the intersection does not matter and neither do the suffixes, for example "ayaax" and "ayaz" will also give optimal prefix trees.

From previous paragraph, we can assume that if we had plenty of words in a trie, and the trie was optimal, then the total intersection of all the words must be a common prefix between them. (Note that such condition is necessary but not sufficient). So, let us for example say that we are merging two different optimal tries of words, each generate from a different list of words. For all words in list A the common prefix will be equal to their intersection and the same is true for list B. We want to merge both tries such that the final trie will be optimal. For this list to be optimal, we would need all of the words in both A and B to have a common prefix equal to the intersection of all words in A and B. This is actually possible, because if A1,A2,...An were the elements of A, and B1,B2,...,Bm the elements of B, then we can assume that the common prefix among A will consist of the characters in (A1 ∩ A2 ∩ ... ∩ An) and the common prefix among elements of B will consists of the characters in (B1 ∩ B2 ∩ ... ∩ Bm). The total intersection is: (A1 ∩ A2 ∩ ... ∩ An ∩ B2 ∩ ... ∩ Bm). The key in here is that (A1 ∩ A2 ∩ ... ∩ An ∩ B2 ∩ ... ∩ Bm) is a subset of (A1 ∩ A2 ∩ ... ∩ An) and also a subset of (B2 ∩ ... ∩ Bm), this is a known fact of set theory: ( (A ∩ B) c B ) .

Because for optimal tries we want the prefixes to consist of the characters in the intersection, but the order does not matter, we can assume that the prefixes of A1,A2,...An all start with the characters that are in (A1 ∩ A2 ∩ ... ∩ An ∩ B2 ∩ ... ∩ Bm). We can do something similar for the elements in B. Then we can say that all the elements in both A and B will have a prefix in common that consists of the characters in set (A1 ∩ A2 ∩ ... ∩ An ∩ B2 ∩ ... ∩ Bm). This common prefix will become a common path in both tries. So, we can combine the tries to form a single one. The size of the merged trie would be: (Size of trie A) + (Size of trie B) - (Size of nodes in common). The size of the nodes in common is equal to the size of (A1 ∩ A2 ∩ ... ∩ An ∩ B2 ∩ ... ∩ Bm) plus 1. (Because all the prefix trees have a root node that represents an empty string).


Did you see that? This means that in order to know the minimum trie size of a trie that combines two tries from lists A and B, we only need the optimal trie size for the words in A, the optimal trie size for the words in B, and the size of the intersection between all words in A and B. So, if we are given a list of words, we just need to somehow split it into two parts A and B, then find the optimal trie sizes for A and B and use that procedure. Correctly guessing which parts A and B to pick is difficult though. But we do not need to find a way to guess it, we can just try all possible pairs A and B that partition the set of words in two parts, and then pick the pair that would yield the minimum size after merging the tries.

The last paragraph implied that we have a recursion. We need a function trieMin() that takes a set of words, and returns its minimum trie size. What it will do is try all subsets A, find the complement B, call trieMin(A) and trieMin(B) to find out the optimal trie sizes of A and B and then pick the minimum total size. You should note that no matter which two complementary sets A and B we pick, the total intersection is always the total intersection among all the words in the set originally given to trieMin.

trieMin will always take a subset of the original words array in the input. That means that if its size was N, then there are at most 2N subsets this function can take. With N=16, we can just use memoization for this function, so it never evaluates the same call twice. Note that A and B will always be smaller than the argument set, so the recursion would be acyclic (which is necessary for the memoization to work).

Inside the trieMin function, we need to iterate through all possible subsets of the given set. If we iterate through all possible subsets of each subset of a set of size N, the total complexity is O(3N). As a quick proof, try counting the number of steps. Begin by counting the number of steps caused by subsets of size 0. That is: C(N,0) * 20 . (Because there are C(N,0) subsets of size 0, and for each of them you'll need 20 steps. For the subsets of size 1, you'll need: C(N,1) * 21, and we can repeat:


C(N,0) * 20 + C(N,1) * 21 + ... C(N,N-1) * 2N-1 + C(N,N) * 2N


Let us just use a little imagination, and add a couple of powers of 1:

C(N,0) * 20 * 1N + C(N,1) * 21 * 1N + ... C(N,N-1) * 2N-1 * 11 + C(N,N) * 2N * 10

If we know anything about The binomial coefficient then we can tell that the previous sum is equal to:

(2 + 1)N = 3 N.

Therefore, the complexity of our algorithm is O(N3) which is good to run in time for N=16. But we still need to actually implement it. First of all, we need a way to find the size of the intersection of the characters in a list of words, note that words may contain the same character more than once, so you need a special data structure to represent a set and the intersection. Then we need to represent a subset of the list of words and also to iterate through all the subsets of a subset of the list. In both cases, bit operations are very useful. We can represent a set by a bitmask integer such that if the i-th bit is 1, the subset includes element i. In order to iterate through all the subsets A of a set, just use the ( i = (i - 1) & superset ) trick explained in bmerry's tutorial . And to get the complementary subset B , just negate A, and intersect it with the original bitmask.

    int n;
//our set structure is simply a size [26] array that holds the count
// of each character in the word.
int swords[16][26];

// The memoization array...
int mem[1<<16];
// Returns the optimal trie size for the subset of
// words represented by the bits in mask.
int trieMin(int mask) {
int & res = mem[mask];
if(res==-1) {
res = 851; //largest trie size is smaller than 17*50

//get intersection size...
int inters[26];
fill(inters,inters+26,50);
int t=0;
for (int i=0; i<n; i++) {
if( mask&(1<<i)) {
for (int j=0; j<26; j++) {
inters[j] = std::min(inters[j], swords[i][j]);
}
t++;
}
}
// (to intersect two sets, just get the minimum count for
// each character). The sum of these counts is the size
// of the intersection.
int isize = accumulate(inters,inters+26,0)+1;
// +1 because of the root node which also belongs to the
// intersection.

if(t==1) {
// only one word, its size is the optimal trie size.
res = isize;
} else {
// Try all possible subsets and pair them with their
// complements.
for (int sub=mask-1; sub>0; sub=(sub-1)&mask) {
int nsub = mask &(~sub);
int A = trieMin(sub); //optimal trie size for the subset
int B = trieMin(nsub); //... for its complement.

// total size when merging these tries is A+B - isize.
// keep the minimum one.
res = std::min(res, A+B - isize);
}
}
}
return res;
}
int getNumNodes(vector <string> words)
{
n = words.size();
//convert words to the set format:
for(int i=0; i<words.size(); i++) {
fill(swords[i],swords[i]+26,0);
for (int j=0; j<words[i].size(); j++) {
swords[i][words[i][j]-'a']++;
}
}
// initialize the mem array.
memset(mem,-1,sizeof(mem));

// recurse...
return trieMin( (1<<n)-1 );
}

2 comments :

aksonov said...

Great edutorial! One note - i've got stackoverflow error for Java equivalent, so i had to prefill memoization array from 1..(1<<n) first and then return result. This way no deep recursion will be.

vexorian said...

Thanks, it is odd they didn't pick 15 as a constraint for this problem.