Archive

Posts Tagged ‘RuleBasedCollator’

String Sorts

January 30, 2010 7 comments

We can implement a simple case-sensitive String sort by using the java.util.Arrays.sort() static method. Since there are a number of ways of sorting, we probably want a Class specifically for doing this, called something fairly self-evident like StringSort. An initial example of a method we might usefully add to this Class at the outset is given below:

public String simpleSort(String sToSort)
{
String[] sortable = sToSort.split(" ");
Arrays.sort(sortable);
return Arrays.toString(sortable);
}

It will return the String you pass it split on whitespace tokens in alphabetical order, with the exception that words in upper case will precede those in lower case so that a String which looks like:

String sx = "boy archer a baby clock Charlie chance charlie";

will, when sorted, return this:

[Charlie, a, archer, baby, boy, chance, charlie, clock]

We fairly obviously don’t want the nice array formatting coming back so we can improve this with a private method in the Class to clean this up:

// removes [ and ] enclosures from array toString return and the commas
private String topAndTail(String toDo)
{
toDo = toDo.trim();
toDo = toDo.replaceFirst("\\[", "");
toDo = toDo.replaceAll(",", "");
toDo = toDo.substring(0, toDo.length() - 1);
return toDo;
}

This is likely to be called fairly frequently so we will sooner or later be looking at doing this sort of thing from a StringBuffer perspective for the usual efficiency reasons, but for the moment we’ll live with the overhead.

Then all we have to do is modify our initial method so it looks

public String simpleSort(String sToSort)
{
String[] sortable = sToSort.split(" ");
Arrays.sort(sortable);
String xx = Arrays.toString(sortable);
return topAndTail(xx);
}

We’d invoke this typically as follows:

StringSort ss = new StringSort();
String sx = "boy archer a baby clock Charlie chance charlie";
sx = ss.simpleSort(sx);
System.out.println(sx);

And it will return you a print out of the String in the shape you want it in:

Charlie a archer baby boy chance charlie clock

Now obviously it would be rather useful to be able to do this in a case-insensitive fashion and so we can add another method to our class and reuse the topAndTail method to do the cleanup:

public String insensitiveSort(String sToSort)
{
String[] sortable = sToSort.split(" ");
Arrays.sort(sortable, String.CASE_INSENSITIVE_ORDER);
String xx = Arrays.toString(sortable);
return topAndTail(xx);
}

which when run on the previous String will give us an output like this:

a archer baby boy chance Charlie charlie clock

This is all good when the user is using the default locale language. However what if we need to order a String composed of Spanish or French words where the sort order is dependant on cedillas, accents, etc? This again is easy enough to do, this time though we need to make use of the Collator class to order by locale.

public String intnlSort(String sToSort, String sLocale)
{
String[] sortable = sToSort.split(" ");
List l = Arrays.asList(sortable);
Collator collator = Collator.getInstance(new Locale(sLocale));
Collections.sort(l, collator);
String[] sl = (String[]) l.toArray(new String[0]);
String xx = Arrays.toString(sl);
return topAndTail(xx);
}

Invoking this method as follows by passing the French locale identifier “fr” as the second argument:

String words = "café cafeteria cafe";
words = ss.intnlSort(words, "fr");

will return this correctly (from a French perspective at least) as

cafe café cafeteria

and not as

cafe cafeteria café

which the locale-unaware simpleSort and insensitiveSort will do.

It’s also possible to set your own sort order using the RuleBasedCollator.This class is a concrete subclass of Collator which can be used for string collation. An instance of this class is typically returned by the getInstance method of Collator with rules for the requested locale. An instance of this class can also be created manually with any desired rules.

public String customSort(String sToSort, String sortRule) throws ParseException
{
String[] sortable = sToSort.split(" ");
List l = Arrays.asList(sortable);
RuleBasedCollator collator = new RuleBasedCollator(sortRule);
Collections.sort(l, collator);
String[] sl = (String[]) l.toArray(new String[0]);
String xx = Arrays.toString(sl);
return topAndTail(xx);
}

A call to this would look something like:

String aa = "at bath all some time";
String rule ="< t, T < s, S < b, A";

try
{
aa = ss.customSort(aa, rule);
}
catch (Exception e)
{
System.out.println("Parse error in Rule String");
}

And will return you a String sorted as follows

time some bath at all

If the exception is thrown, there’s almost certainly a problem with the way you formatted your rule String. Formatting the rule String can be tricky, you just need to make sure you get it right. The first element is your start point, the highest ranked lower-case character of your intended sort order, preceded by a single < assertion.

In this case we want “t” as the front-runner. In the rule it’s succeeded by a second comparative element where the same character is expressed in upper-case and compared to the lower-case alphabetic representation of the character it should precede in the sort, and so on, i.e.

T < s

The final element should be the upper case representation of the character to be lowest in the sort order, in this case A.

Modifying the rule slightly so that B is the lowest end representation as follows

String rule ="< t, T < s, S < a, B";

will produce a different result unsuprisingly, one in which “a” precedes “b”:

time some at all bath

Of course in reality your rule will be much more complex than this. Let's say you want all the vowels at the front of your sort in alphabetical order:

String rule = “< a, A < e, E < i, I < o, O < u, U < b, B”;

To test this we could do something like:

StringSort ss = new StringSort();
String rule = "< a, A < e, E < i, I < o, O < u, U < b, B";
String aa = "earth ebola bird ebcidic eon aeternitas aether bet bath ";
try
{
aa = ss.customSort(aa, rule);
}
catch (Exception e)
{
System.out.println("Parse error in Rule");
}
System.out.println( aa);

Which will give you an output which looks like:

aeternitas aether earth eon ebola ebcidic bath bet bird

You’ll notice that the sort persists throughout the words themselves e.g. “ebola” precedes “ebcidic” since we have assigned all vowels to have precedence over consonants and in the third letter the vowel o will precede the consonant c.

This is just a brief glimpse into the intricacies of String sorting. There are many other ways of performing String sorts and different types of sorts which can be executed.