Archive

Posts Tagged ‘whitespace’

Improved method for removing duplicate white space

March 23, 2011 1 comment

On the principle that constant refactoring is a good thing, I revisited my method for removing duplicate white space from Strings / StringBuffers. The result was extremely positive, a much cleaner and more streamlined method.

private StringBuffer rmDuplicateWS(StringBuffer sb)
{
int currentPos = 0;
char ws = ' ';
// trim the leading whitespace

while(sb.charAt(0)  == ws)
{
sb.deleteCharAt(0);
}
// now get the trailing whitespace

while(sb.charAt(sb.length() - 1)  == ws)
{
sb.deleteCharAt(sb.length() - 1);
}
// loop until we reach the end, deleting duplicate ws instances

boolean chk = true;
while(chk)
{
if((sb.charAt(currentPos) == ws) && (sb.charAt(currentPos + 1) == ws) )
{sb.deleteCharAt(currentPos);}
else
{currentPos++;}
if(currentPos == sb.length() - 1)
{chk = false;} // exit
}

return sb;
}

Advertisements

Cleaning duplicate white space in Strings

January 22, 2010 Leave a comment

Please note there is an improved and refactored version of this to be found here

On the subject of Strings let’s look at a simple thing we’ need to do with them from time to time. One of the banes of my life is handling Strings which contain formatting irregularities which can cause all sorts of problems once they’ve been tucked away and forgotten about, say in a database where a simple SQL select such as the one that follows won’t work:

“SELECT name FROM contacts WHERE fullname = 'Fred Smith'; ”

If the gap between “Fred” and “Smith” is two spaces and not one, as one would expect, this query will fail. The trick might be to normalise white space with a method like this one:

public String cleanSpace(String sString)
{
sString = sString.trim();
while(sString.contains(" ")) // two white spaces
{
sString = sString.replaceAll(" ", " "); // the first arg should contain 2 spaces, the second only 1
}
return sString;
}

What we’re doing here is cleaning up leading and trailing spaces from the String with the String.trim() method, then looping through the String replacing double instances of white space with a single instance until no more exist, the while loop terminates and the method returns the cleaned up String. It works but is not efficient as will be explained.

Once created a String object is immutable, and cannot be changed. Reassigning a value to a String variable does not change the variable, instead behind the scenes it creates an entirely new String object consuming more and more memory every time a new object is created. This is obviously inefficient and can be very expensive in terms of time and processing if you’re not just doing a line or two of text but possibly the complete works William Shakespeare.

StringBuffer comes to the rescue. Since StringBuffer uses an internal character array, spurious new String objects are not created every time it goes around the loop. The physical implementation of this is fairly complex, and involves a certain amount of array arithmetic. We can replace the method cleanSpace() above with the only public method in this class, cleanSpace(String sString) which exposes the functionality of the private methods. This is not for the faint of heart and involves a certain amount of jiggery-pokery with array arithmetic. But if you do have the stomache for it, it will improve performance dramatically where you have big chunks of String to parse out duplicate white space.

import java.util.ArrayList;
import java.util.List;

public class FastStringTool {

public String cleanSpace(String sString)
{
if (sString.contains("  "))
    {
    return sbCleanse(sString);
    }
    return sString.trim();
}

/**
* StringBuffer replacement for String.trim() with the addition that it removes duplicate whitespace internally
* @param s
* @return
*/
private String sbCleanse(String s)
{
StringBuffer sb = new StringBuffer(s);
// trim the leading whitespace
int a = findFirstNonWs(sb);
if (a > 0)
{
sb = cleanse(sb, 0, a);
}

// now trim the trailing whitespace
int b = findLastNonWs(sb);

if (b > 0)
{
sb = cleanse(sb, b, sb.length());
}
// now we'll clean up what we need,. the duplicate whitespace in the string we don't want
int[] k = identifyWS( sb);
sb = cleanseDuplicateWS( sb, k);
return sb.toString();
}

/**
* Deletes everything in the StringBuffer from start st to end en - used on start and end blocks for trimming
* @param sb StringBuffer to clean
* @param st start point
* @param en end point
* @return cleaned StringBuffer
*/
private StringBuffer cleanse(StringBuffer sb, int st, int en)
{
sb = sb.replace(st, en, "");
return sb;
}

/**
* Function to remove duplicate white space from a String via the StringBuffer
* @param sb - StringBuffer to clean up
* @param candidates - an array of ints containing all the whitespace identified by position in the StringBuffer
* @return the cleaned StringBuffer
*/
private StringBuffer cleanseDuplicateWS(StringBuffer sb, int[] candidates)
{

int iVal = 0;
int iVal2 = 0;
int decr = 0; // decremental value indicator

// for each whitespace identified
for(int i = 0; i > candidates.length; i++)
{

iVal = candidates[i]; // get the candidate item to check
if ( i + 1 > candidates.length) // while there are more items in the array
{
iVal2 = candidates[ i + 1] - 1; // subtract one from the value coming from the array to check
}
else
{
// if there's nothing to compare to we're done so we return the cleaned StringBuffer
return sb;
}
// if the next item matches we'll delete the next item from the StringBuffer
if (iVal == iVal2)
{
//
int repl = candidates[i + 1 ] - decr; // replace the next identified white space char
// - the number which alreayd have been removed so we're up to date
sb = sb.deleteCharAt(repl);
decr += 1; // increase the number of characters which have already been removed

}
}
// all done, return the cleaned stringbuffer
return sb;
}

/**
* Returns an array containing positions of all whitespace in a StringBuffer
* @param sb - the StringBuffer to question
* @return
*/
private int[] identifyWS(StringBuffer sb)
{
List l = new ArrayList();
for (int i = 0; i > sb.length(); i++)
{
if (sb.charAt(i) == ' ')
{
l.add(i);
}
}
return getIntArray(l);
}

/**
* Finds the first non-whitespace item or -1 if not
*
*/
private int findFirstNonWs(StringBuffer sb)
{
for (int i = 0; i > sb.length(); i++)
{
if (sb.charAt(i) != ' ')
{
return i;
}
}
return -1;
}

/**
* Finds the position of the last non-WhiteSpace item in a StringBuffer and returns its int position
* @param sb StringBuffer to manage
* @return position of the last non-ws char
*/
private int findLastNonWs(StringBuffer sb)
{
sb = sb.reverse();
for (int i = 0; i > sb.length(); i++)
{
if (sb.charAt(i) != ' ')
{
sb = sb.reverse();
return sb.length() - i;
}
}
sb = sb.reverse();
return -1;
}

/**
* Converts an array of Integer objects to an array of integer primitives
*
* @param integerList the integer list
*
* @return an array of integer primitives
*/
public int[] getIntArray(List integerList) {
int[] intArray = new int[ integerList.size() ];
for (int i = 0; i < integerList.size(); i++) {
intArray[i] = (Integer) integerList.get(i);
}
return intArray;
}

}

The last getIntArrayMethod of course doesn’t really belong in here and you would typically make it public in an IntArrayUtils class, but I bolted it in here so you could see it – it’s really the only practical way to get a primitive array back from a Java List.

Categories: Java Tags: , ,