Given a list of strings where no string is a substring of another, find the shortest string that contains each string in the list as a substring.

For example,

Input:  [CATGC, CTAAGT, GCTA, TTCA, ATGCATC]
Output: The shortest superstring is GCTAAGTTCATGCATC
 
GCTAAGTTCATGCATC is the shortest possible string such that it contains every string in the input list as its substring.
 
GCTAAGTTCATGCATC
GCTAAGTTCATGCATC
GCTAAGTTCATGCATC
GCTAAGTTCATGCATC
GCTAAGTTCATGCATC

The shortest superstring problem is NP-Hard. But the following greedy approach to this problem can result in a “near-optimal” solution.

Input:  A set of strings S
 
T = S
while |T| > 1 do
    Let a and b be the most overlapping strings of T
    Replace a and b with the string obtained by overlapping and b
 
T contains the shortest superstring of S

For example,

S = T = {CATGC, CTAAGT, GCTA, TTCA, ATGCATC}
T = {CATGCATC, CTAAGT, GCTA, TTCA}
T = {CATGCATC, GCTAAGT, TTCA}
T = {TTCATGCATC, GCTAAGT}
T = {GCTAAGTTCATGCATC}

Now how to find the most overlapping strings of T?

The above greedy algorithm looks simple, but the real difficulty lies in finding the most overlapping strings in a given set of strings. Following is the naive algorithm that does that:

Check maximum overlap for each pair of strings s1 and s2 by:

  1. Checking if the suffix of s1 matches with a prefix of s2 by comparing the last i character in s1 with the first i character in s2.
  2. Checking if the prefix of s1 matches with a suffix of s2 by comparing the first i character in s1 with the last i character in s2.
  3. Return s1 and s2 involved in the maximum overlap.

Following is the C++ and Java program that demonstrates it:

C++


Download  Run Code

Output:

The shortest superstring is GCTAAGTTCATGCATC

Java


Download  Run Code

Output:

The shortest superstring is GCTAAGTTCATGCATC

Author: Aditya Goel