Algorithmic Thinking: A Problem-Based Introduction Homepage for the book Algorithmic Thinking: A Problem-Based Introduction

Solution to the Other COCI Programming Competition Puzzle

Hi everyone,

Let’s solve the COCI (Croatian Open Competition in Informatics) programming puzzle that I gave at the end of Last week’s post. Give it a try if you haven’t already done so…

This problem is a …

a …

a …

a hash table problem! In Chapter 1 of my book Algorithmic Thinking, I offered a clue for when hash tables might apply: it’s when you’re getting bogged down in repeated searches. (You can read Chapter 1 of the book for free.) I’ll move quickly here; I encourage you to read much more about hash functions and hash table design in Chapter 1.

I think it’s a bit tricky to see why this is a problem that requires lots of repeated searches. So let’s first step back and solve this thing without hash tables and see what happens.

A First Solution

We’re looking for all pairs of strings where the first string is a substring of the second string. (This corresponds to a password that’s a substring of another password.) We can implement this using two nested for-loops to consider each pair of strings in turn. For the substring check, we can use the C strstr function, which searches its first parameter for an occurrence of its second parameter.

Here’s my code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_passwords 20000
#define MAX_LENGTH 10

int main(void) {
  static char passwords[MAX_passwords][MAX_LENGTH + 1];
  int n, total, i, j;

  scanf("%d", &n);
  for (i = 0; i < n; i++)
    scanf("%s", passwords[i]);

  total = 0;
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
      if (i != j && strstr(passwords[j], passwords[i]))
        total++;

  printf("%d\n", total);
  return 0;
}

The i != j check (19) is important here. It prevents us from spuriously counting the fact that any user can log-in as themselves.

There are two nested for-loops here, both of which depend on n (the number of passwords), so we’re looking at an $O(n^2)$ algorithm. This means that we’re doing about $n^2$ string searches. We can have up to 20000 passwords, so our program could do up to $20000^2 = 400000000$ string searches. The problem has a tight timeline of just 1 second. We’re not going to be able to do even close to 400 million string searches in that time. (As a very loose estimate, we might be able to do 100 million basic operations per second. Not only might we have to do more than this number of calls to strstr, but what strstr does is certainly not a basic operation, because it might have to try a lot of starting points before it finds a complete match.) Still, if you submit what we have, you should pass some test cases in time, giving us some confidence that we’re doing the right thing. We just need to do it faster.

All of Those String Searches…

Consider this test case from the problem description:

5
mir
mirta
ta
ir
t

In how many passwords is mir a substring? To answer that, we consider each other string, performing a strstr with it. Is mir a substring of mirta? Yes – and strstr will tell us so. Is mir a substring of ta? No, and strstr will tell us not. We continue, checking whether mir is a substring of each other string.

And that’s where we lose, because this is too much searching. We’re wasting too much time searching for mir over and over. After that, we do the same whole lot of searching for mirta, then a whole lot of searching for ta and so on.

Suppose instead that we knew in advance the number of other substrings in which mir showed up… like someone just told us, “hey, the answer is 1.” Then we wouldn’t have to do all of those string searches; we could just look up the answer straight away.

That’s precisely what a hash table is going to allow us to do. We’re going to populate it with the number of times that each substring of each password shows up. Then, we’re going to just look up those answers, dodging the massive number of string searches that we were doing before.

Hash Table Time!

We’ll start with an empty hash table. Then, we’ll go through each password, adding each of its substrings to the hash table. The hash table will keep track of the total number of occurrences of each substring.

For mir, we’ll add an occurrence of each of these strings to the hash table:

For mirta, we’ll add an occurrence of each of these strings:

Some of these strings, like m and mi, now have counts of 2.

For ta, we’ll add an occurrence of each of these strings:

For ir, we’ll add an occurrence of each of these strings:

And, finally, for t, we add an occurrence of just:

There we go! That’s our hash table. Here’s a summary of all of its strings and counts:

string count
a 2
i 3
ir 3
irt 1
irta 1
m 2
mi 2
mir 2
mirt 1
mirta 1
r 3
rt 1
rta 1
t 3
ta 2

In how many passwords is mir a substring? Check the hash table: it’s 2! Except that one of them is from mir itself, so we don’t want to count that one. So we have just 1 other password in which mir is a substring. Our total so far is 1.

How about mirta: in how many other passwords is mirta a substring? The hash table stores 1 for mirta. We subtract 1 to conclude that we have 0 relevant strings here. So our total is still 1.

Continuing: ta is a substring in 1 other password, ir is a substring in 2 other passwords, and t is a substring in 2 other passwords. Including our previous total of 1, we have a grand total of $1+1+2+2=6$. Our answer is 6!

Incidentally, you might notice that there’s a lot of garbage in this hash table, in the form of strings that we’ll never look up. For example, the string a. No one has a password of a, so we don’t care how many occurrences of a there are. That’s OK though: we’ll just fill up the hash table, and then use only what we need.

A Hash Table Solution

Now let’s implement this strategy using hash tables.

Here’s my code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_passwords 20000
#define MAX_LENGTH 10

typedef struct password_node {
  char *password;
  int total;
  struct password_node *next;
} password_node;


// A simple hash function (suffices for this problem)

int hash(char *password) {
  int i = 0;
  int result = 1;
  while (password[i] != '\0') {
    result = (result * 26 + password[i]) % MAX_passwords;
    i++;
  }
  return result;
}


// Return hash table node for password; add it if it isn't already there

password_node *find(password_node *hash_table[], char *password) {
  int password_code;
  password_node *passwordptr;
  password_code = hash(password);
  passwordptr = hash_table[password_code];
  while (passwordptr) {
    if (strcmp(passwordptr->password, password) == 0)
      return passwordptr;
    passwordptr = passwordptr->next;
  }

  // Password isn't in hash table. Add it
  password_node *node = malloc(sizeof(password_node)); // elided error checking
  node->password = malloc(MAX_LENGTH + 1);
  strcpy(node->password, password);
  node->total = 0;
  node->next = hash_table[password_code];
  hash_table[password_code] = node;
  return node;
}

// Return true iff search is in first num_elements of array

int in_array(char array[][MAX_LENGTH + 1], int num_elements, char *search) {
  int i;
  for (i = 0; i < num_elements; i++)
    if (strcmp(array[i], search) == 0)
      return 1;
  return 0;
}


int main(void) {
  static password_node *hash_table[MAX_passwords] = {NULL};
  static char passwords[MAX_passwords][MAX_LENGTH + 1];
  char temp[MAX_LENGTH + 1];
  char substrings[MAX_LENGTH * MAX_LENGTH][MAX_LENGTH + 1];
  int n, total, i, len, start, end, num_substrings;

  scanf("%d", &n);
  for (i = 0; i < n; i++) {
    scanf("%s", passwords[i]);
    len = strlen(passwords[i]);
    num_substrings = 0;

    // Loop through endpoints of all possible substrings
    for (start = 0; start < len; start++)
      for (end = start; end < len; end++) {
        strncpy(temp, &passwords[i][start], end - start + 1);
        temp[end - start + 1] = '\0';
        if (!in_array(substrings, num_substrings, temp)) {
          strcpy(substrings[num_substrings], temp);
          num_substrings++;
          password_node *node = find(hash_table, temp);
          node->total++;
        }
      }
  }

  // Hash table is ready. Now add up occurrences of each password
  total = 0;
  for (i = 0; i < n; i++)
    total = total + find(hash_table, passwords[i])->total - 1;

  printf("%d\n", total);
  return 0;
}

In the book, I used a wowzer of a hash function written by Bob Jenkins that blends the keys and distributes them quite well. Here, we can get away with a simpler hash function (15) that for each character multiplies the current result by 26 and then adds the new character. My code using this hash function passes the time limit, so let’s stick with this.

Next, we have a find function (28) that searches the hash table for a password. If it finds it (36), it returns the associated node. If it doesn’t find it (41), then it means that this is the first time we’re seeing this password, so we add it with a total count of 0 (the code in the main function will make sure to increment this to 1).

What’s up with that in_array function – what’s it for? Well, consider a password like aaa. The substring a shows up 3 times in there, but we only want to count it once. If we counted it three times, we’d be saying that a was a substring of three passwords, when of course that isn’t true. So we use in_array to check whether we’ve already considered a given substring of the current password.

Those are all of the pieces we need for the main function (62), so let’s talk about main next.

In addition to the hash table itself (63), I also have an array to keep track of all of the passwords (64). We also need an array to keep track of the substrings of the current password (66), so that we can correctly call in_array. How big should the substrings array be – how many substrings might it have to hold? Well, each password is at most 10 characters, there are 10 places where a substring can start, and there are 10 places where a substring can end. So a safe upper bound here is to use MAX_LENGTH * MAX_LENGTH or 100.

The rest of the code works in two phases. First, we add all substrings of all passwords to the hash table (75). Once that’s done, we go through each password, adding up the number of occurrences in the hash table (89). Remember that we need to subtract 1 each time (92) so that we don’t count the fact that every password is a substring of itself.

If you submit this solution, you should see that it passes all test cases within the time limit. But…

Why Is This Faster?

Remember that we’re trying to do better than $O(n^2)$ time here. You might be skeptical at this point. Because, check it out: we have not only two nested loops, but three (70)(76)(77)! Three! Have e somehow made things even worse than before?

No. Because only one of these loops depends on $n$. The other two are puny loops that depend only on the length of the current password. And each password is only at most 10 characters. Focusing only on $n$, we have an expected $O(n)$ algorithm here: we take $O(n)$ steps to populate the hash table and another $O(n)$ steps to go through the passwords and look up their totals.

Thank You!

I have another hash table problem for you, if you’d like. I won’t go through it in next week’s post – I have something else planned – but I’d encourage you to give it a try for extra practice. Here it is: the Cakey McCakeFace problem from ICPC (International Collegiate Programming Contest) SWERC 2017.

Please feel free to contact me if you have further ideas, requests, etc. I hope these posts are helpful and I’d like to continue to make them better.

I also have a newsletter where I send occasional updates. Please see my Contact and Newsletter page if you’d like to subscribe.