$\begingroup$

As Lopsy states, your problem is NP-hard, so you shouldn't expect an efficient algorithm for your problem.

Your problem is an instance of the maximum coverage problem (which is also NP-hard). This may help you find algorithms that will be helpful in your setting.

There are standard algorithms for the maximum coverage algorithm. There's a natural greedy algorithm; it achieves an approximation ratio of approximately 0.63.

Perhaps a better approach is to formulate this as an instance of integer linear programming (ILP), and then throw an off-the-shelf ILP solver at it (e.g., lp_solve, CPLEX). See Wikipedia for a description of how to do that. For your English example (alphabet size 26) it's possible this might give you an optimal solution, but for the Chinese language problem, you shouldn't expect this to terminate and give you the optimal solution within your lifetime. Instead, I suggest you let it run for a bounded time and do the best it can within that time bound. Standard ILP solvers will let you provide a time bound: e.g., run for 5 minutes and then give me the best solution you've found so for.

You will probably want to apply the following simple optimization. Replace each word with the set of letters it contains (so order doesn't matter and repetitions are ignored). Then, de-dup these sets, and include a weight for each set that counts how many times it appears. Thus, GRIN and RING map to the same set, {G,I,N,R}. This will reduce the problem size and might make the ILP solver run faster.

Depending upon your application, you might also want to weight the words by their frequency of occurrence in natural language and then solve the weighted maximum coverage problem, so that you place an extra emphasis on covering common words and a lower priority on covering rare words.

You can also search the literature for other algorithms for the maximum coverage problem, but I suspect the pragmatic answer is going to be to use ILP with a reasonable time bound.