Archive

Posts Tagged ‘regexps’

Fast (StringBuffer-centric) location of substrings

February 20, 2010 Leave a comment

Over the past few days I’ve extended and refactored my StringBuffer-oriented fast String search classes a little. I’ve got to say that the performance improvement over regexps in dissecting large String objects (and we’re talking gigs of Strings that I’m playing with here) is paying significant dividends. Also, constant refactoring is a paradigm I’ve paid much attention to over the years, there’s no such thing in my book as “perfect code”, there is only code which exposes itself as being in less need of improvement than other code. One piece of code that was high on my hit list was the locateStringContents method, which was ripe for extension to be used to locate substrings and not just single chars. The first thing that struck me on revisiting the method in question was that I had done something which I ought not to have done, which was to call a function unnecessarily in a for loop (here’s the code down to the offending line)

public int[] locateStringContents(StringBuffer sb, StringBuffer sbCheckable)   {
    List l = new ArrayList();
    for (int i = 0; i < sb.length(); i++)
[..]

So I fixed that so that it now looks like this (and I had compounded the error in an inner loop with a similar call to sbCheckable.length so I fixed that at the point where it would only be called once at the same time).

public int[] locateStringContents(StringBuffer sb, StringBuffer sbCheckable)    {
    List l = new ArrayList();
    int iCheckSize = sbCheckable.length();
    int iSbSize = sb.length();
    for (int i = 0; i < iSbSize; i++)
[..]

These look like small and superficial changes but when iterated over several hundred million times the time-savings can be significant and it was important to pick up whilst optimising my code to make it bleed. I’m now systematically doing similar clean-ups on a number of methods which contain similar lapses of coding. It’s all too easily done, particularly when working against tight timescales, and subsequent revisitation of code and classes which are substantial consumers of processor resources can often pay big dividends.

After the cleanup I turned my attention to what needed to be done to do some fast substring location. I approached this initially from a linear point of view with a nested looping construct which matched character for character but the end result was frankly horrible to look at as a piece of code and moreover inefficient by comparison to my subsequent approach to the problem which was to deal with the problem pragmatically and eradicate the complex jungle of nested loops and conditional evaluations with a boolean method, containsMatch, which receives a slice of the right amount of String to examine on matching the first char and then evaluates the slice by comparison to the comparator sequence.


/**
*Method to expose all int start positions of a substring within a greater String, returning an int array containing their
* locations
**/
public int[] locateSubstrings(StringBuffer sb, StringBuffer sbCheckable)
{
List l = new ArrayList();
int iCheckSize = sbCheckable.length();
int iSbSize = sb.length() - iCheckSize; // since we don't want array overruns

for (int i = 0; i < iSbSize; i++)
{
if (containsMatch(sb.subSequence(i, i + iCheckSize),sbCheckable))
{
l.add(i);
continue;
}
}
return toIntArray(l);

 }

/**
*Method to determine whether a sliced CharSequence cs, derived from a String to evaluate, is equivalent to a comparative StringBuffer
**/
public boolean containsMatch(CharSequence cs, StringBuffer sCheckable)
{
int csLength = cs.length();
for (int i = 0; i < csLength; i++)
{
if (cs.charAt(i) != sCheckable.charAt(i)) { return false;} // fail if any one of the chars does not match
}
return true; // didn't fail ergo must be a match
}

public int[] toIntArray(List integerList) {
int iSize = integerList.size();
int[] intArray = new int[iSize];

for (int i = 0; i < iSize; i++) {
intArray[i] = integerList.get(i);
}
return intArray;
}
Advertisements