Archive

Posts Tagged ‘refactoring’

Improved method for removing duplicate white space

March 23, 2011 1 comment

On the principle that constant refactoring is a good thing, I revisited my method for removing duplicate white space from Strings / StringBuffers. The result was extremely positive, a much cleaner and more streamlined method.

private StringBuffer rmDuplicateWS(StringBuffer sb)
{
int currentPos = 0;
char ws = ' ';
// trim the leading whitespace

while(sb.charAt(0)  == ws)
{
sb.deleteCharAt(0);
}
// now get the trailing whitespace

while(sb.charAt(sb.length() - 1)  == ws)
{
sb.deleteCharAt(sb.length() - 1);
}
// loop until we reach the end, deleting duplicate ws instances

boolean chk = true;
while(chk)
{
if((sb.charAt(currentPos) == ws) && (sb.charAt(currentPos + 1) == ws) )
{sb.deleteCharAt(currentPos);}
else
{currentPos++;}
if(currentPos == sb.length() - 1)
{chk = false;} // exit
}

return sb;
}

Advertisements

Implementing a split (or a pseudo-split) for StringBuffers/Builders

March 15, 2011 Leave a comment

One of the functionalities regrettably absent from the StringBuilder/StringBuffer families is the inbuilt and nice String method split(String regexp) (q.v. java.lang.String.split()), which will produce a tokenised array of Strings based around and consuming the supplied regexp token. The cranky way of doing this with a Stringbuffer is to cast your lightweight StringBuffer to a String, split to an array of Strings, then cast the array of Strings back to a StringBuffer array or List, which to me looks somewhat like defeating the object of the entire exercise in working with StringBuffers. I have a marked preference for working with Lists as opposed to arrays but I do realise that there are those of the other faith who have valid reasons for their heretical idolatry (j/k) so I’ll provide methods for both outcomes.

Given that we have a method for providing a List of token positions from a supplied StringBuffer (q.v. Locating token positions[..]) (and I have a somewhat improved method for doing this which I will anyway supply as an appendage to this post – the refactored method has been renamed getTokenPositions as opposed to the earlier findTokens) the way is clear for us to implement the new split method.

/** Method to split an inbound StringBuffer by (consumed) tokens and produce a List 
* @param StringBuffer sbx - the StringBuffer to split
* @param StringBuffer sbTok - a StringBuffer representation of token(s) to use to split
* @return List of StringBuffers split out
*/
public List split(StringBuffer sbx, StringBuffer sbTok)
    {
    int tokSz = sbTok.length();
    List lix = new ArrayList();
    List lPos = getTokenPositions(sbx, sbTok );
    if( lPos.isEmpty() || lPos == null) // no split?  send the original sb back
    {
        lix.add(sbx);
        return lix;
    }

    int start = 0;
    if(lPos.get(0) == 0)
    {
    start += tokSz;
    }

    int iSz = lPos.size();

        for (int i = 0; i < iSz; i++) {
            StringBuffer sbnew = new StringBuffer();
        if(i + 1 == iSz)
        {
        sbnew = new StringBuffer(sbx.subSequence(start, sbx.length()));
        }
        else
        {
                sbnew = new StringBuffer(sbx.subSequence(start, lPos.get(i + 1)));
                start = lPos.get(i + 1) + tokSz;
            }
           // System.out.println(sbnew.toString());
            lix.add(sbnew);
        }

    return lix;
    }

To produce an Array of StringBuffers, you merely need to change the return method signature

public StringBuffer[] split(StringBuffer sbx, StringBuffer sbTok)

and modify the code where the returns occur (2 places) to read:

 return (StringBuffer[]) lix.toArray();

Modified method for providing a List of token positions

I mentioned earlier I had a somewhat improved version of the findTokens method. The code for this (+ the comparator-helper List construction methods) follows:

 public List getTokenPositions(StringBuffer sbx, StringBuffer tok )
    {
    List liTok = charListFromSb(tok);
    List liOut = new ArrayList();
    int sz = tok.length() - 1;
    int finish = sbx.length() - sz;
    char firstTok = tok.charAt(0);
    char lastTok = tok.charAt(sz);
        for (int i = 0; i < finish; i++) {
            if ( (sbx.charAt(i) == firstTok)   && (sbx.charAt(i + sz) == lastTok) )
            {
            List comp =  charListFromSb(sbx, i, i+ sz);
            if (comp.equals(liTok))
              {
                boolean add = liOut.add(i);
              }
            }
        }
    return liOut;
    }

 public List charListFromSb(StringBuffer sbx)
    {
        List liOut = new ArrayList();
        int iEnd = sbx.length();
        for (int i = 0; i < iEnd; i++) {
            boolean add = liOut.add(sbx.charAt(i));
        }

    return liOut;
    }
 public List<Character> charListFromSb(StringBuffer sbx, int start, int finish)
    {
       
        List<Character> liOut = new ArrayList<Character>();
        for (int i = start; i <= finish; i++) {
            boolean add = liOut.add(sbx.charAt(i));
        }
    return liOut;
    }

Working with XPath

February 16, 2011 1 comment

It has always struck me that XPath is a nice tool for XML forensics. It can be exposed very easily and quickly, and, from a programmatic point of view, can be used to open up and expose the intricacies of often very complex xml documents with a minimum of effort.

Here’s a (slightly simplified refactored and cleaned) version of my basic setup class for examining documents with XPath.

import java.io.IOException;
import java.io.InputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.namespace.QName; //not actually used in this vn. but can be handy to have around...
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XPathBase {

private DocumentBuilderFactory domFactory;
private DocumentBuilder builder;
private Document doc;
private String xmlFile;
private XPath xPath;
private String resourceRoot = "";
private InputStream inputStream ;
private javax.xml.xpath.XPathExpression expression;
private DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

/**
* Constructor takes 2 args, file to examine, and the resource location
**/
public XPathBase(String xFile, String resRoot) {
resourceRoot = resRoot;
xFile = resourceRoot + xFile;
this.xmlFile = xFile;
setDomObjects();
}

public InputStream getAsStream(String file)
{
inputStream = this.getClass().getResourceAsStream(file);
return inputStream;
}

public void setDomObjects()
{
try {
domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
builder = domFactory.newDocumentBuilder();
doc = builder.parse(getAsStream(xmlFile));
xPath = XPathFactory.newInstance().newXPath();
} catch (SAXException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
} catch (ParserConfigurationException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
}

}

public Document getDoc() {
return doc;
}

public InputStream getInputStream() {
return inputStream;
}

public XPath getxPath() {
return xPath;
}

public String getXmlFile() {
return xmlFile;
}

// [..] Getters & setters for other objects declared private may follow (i.e. add what you need although typically you will only need getters for the XPath, the Document, and the InputStream)

// Not really part of this class and would be (ordinarily) implemented elsewhere
public void readXPath(String evalStr)
{
try {
XPathExpression expr = xPath.compile(evalStr);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;

for (int i = 0; i < nodes.getLength(); i++) {
if(nodes.item(i).getNodeValue() != null)
{
 System.out.println(nodes.item(i).getNodeValue());
}
}

} catch (XPathExpressionException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
}

}

}

As I mentioned in the source code the readXPath method at the tail is not really a part of this class and is provided for illustrative purposes and will allow us to quickly begin to get under the bonnet.

Let’s set up a piece of trivial xml for examination

<?xml version="1.0" encoding="UTF-8"?>

<root>
<people>
<person ptype = "author" century = "16th">William Shakespeare</person>
<person>Bill Smith</person>
<person ptype = "jockey">A P McCoy</person>
</people>
</root>

Assuming you had a file called test.xml in a resources folder you would action this as follows:


XPathBase xpb = new XPathBase("test.xml","resources/");
xpb.readXPath("//people/person/*/text()");

The readXPath method really belongs in another class which is more concerned with the handling of XPath expressions and manipulation than the nuts and bolts of xml document setup and preparation. The following code is an example of how this might begin to look (it won’t look anything like this in the long run, but it will give you an idea of how refactoring and reshaping a class can pay big dividends).


import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import org.w3c.dom.NodeList;

public class XPathAnalyser {
    private String expression = "";
    private String file = "test.xml";
    private String resourceRoot = "resources/";
    private XPathBase xpb;
    private XPathExpression expr;

    public XPathAnalyser(String expr) {
        expression = expr;
        xpb = new XPathBase(file,resourceRoot);
    }
    public XPathAnalyser(String expr, String xFile, String resRoot) {
        expression = expr;
        file = xFile;
        resourceRoot = resRoot;
        xpb = new XPathBase(file,resourceRoot);
    }

public void readXPath(String evalStr)
    {
        try {
            XPathExpression expr = xpb.getxPath().compile(evalStr);
            Object result = expr.evaluate(xpb.getDoc(), XPathConstants.NODESET);
            NodeList nodes = (NodeList) result;

            for (int i = 0; i < nodes.getLength(); i++) {
                if(nodes.item(i).getNodeValue() != null)
                {
                         System.out.println(nodes.item(i).getNodeValue());
                }
            }

        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathBase.class.getName()).log(Level.SEVERE, null, ex);
        }

	}

}

This will work but it has a few drawbacks, notably that the class is dependant on the previous implementation of XPathBase, and strong dependencies are a bad thing. A better implementation would take the XPath setup class in as a class object in its own right and introspect accordingly. This would allow fully separation of context from implementation, use different document models, etc. We’ll live with this limitation for the moment while we begin to construct the XPathAnalyser in a more decomposed and useful shape. Most of the refactoring is about moving to a more rigorous setter/getter paradigm.

We can improve even as basic a method as readXPath with a bit of root and branch surgery. Making the compiled expression and the resultant NodeList private class variables gives us a lot more traction on the problem. Note the overloaded init method which is invoked without args in the constructor and has an overload which allows a fresh expression to be safely supplied.

public class XPathAnalyser {
    private String expression = "";
    private String file = "test.xml";
    private String resourceRoot = "resources/";
    private XPathBase xpb;
    private XPathExpression expr;
    private NodeList nodes;

    public XPathAnalyser(String expr) {
        expression = expr;
        xpb = new XPathBase(file,resourceRoot);
        init();
    }
    public XPathAnalyser(String exprStr, String xFile, String resRoot) {
        expression = exprStr;
        file = xFile;
        resourceRoot = resRoot;
        xpb = new XPathBase(file,resourceRoot);
        init();
    }

    public void init()
    {
          setExpression();
            setNodeList();
}
    public void init(String exprString)
    {
    expression = exprString;
    init();
    }

    public void setExpression()
    {
        try {
            expr = xpb.getxPath().compile(expression);
        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathAnalyser.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

public void setNodeList()
    {
        try {
            nodes = (NodeList) expr.evaluate(xpb.getDoc(), XPathConstants.NODESET);
        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathAnalyser.class.getName()).log(Level.SEVERE, null, ex);
        }
}

public void readXPath()
    {
        try {
      
            for (int i = 0; i < nodes.getLength(); i++) {
                if(nodes.item(i).getNodeValue() != null)
                {
                System.out.println(nodes.item(i).getNodeValue());
                }
            }

        } catch (Exception ex) {
            Logger.getLogger(XPathBase.class.getName()).log(Level.SEVERE, null, ex);
        }

	}

    public int getNodeListSize() {
        return nodes.getLength();
    }

    public NodeList getNodeList() {
        return nodes;
    }

}

The simple getters getNodeList() & getNodeListSize are useful since we are now in a position to work with the object in a much more amenable fashion. We can dig a little deeper by adding a reader to the class to examine attributes.

public boolean containsAttributes()
{
    int k = getNodeListSize();
    for (int i = 0; i < k; i++) {
    if( nodes.item(i).hasAttributes())
    {
     return true;
    }
return false;
}

public void readAttributes()
{
if (!containsAttributes())
{return;}

    for (int i = 0; i < getNodeListSize(); i++) {
        NamedNodeMap nnm =    nodes.item(i).getAttributes();
            for (int j = 0; j < nnm.getLength(); j++) {
                 //  System.out.println(nnm.item(j).getLocalName());
                 String attr = nnm.item(j).getLocalName();
                 System.out.println(nnm.getNamedItem(att));
            }
    }
}

This can obviously be extended and modified as necessary but the crux of the matter is that once you can construct the XPath and produce a viable nodelist, there’s very little you can’t do in the way of parsing and dissecting an XML document.

Categories: Java Tags: , , ,

Extending and the improving the token location method

December 16, 2010 Leave a comment

I have been busily building up the methods which surround the token location method I outlined in yesterday’s post since I aim to build a robust StringBuffer/Builder helper class which is as flexible and intuitive as the locator methods expressed in the standard Java String class.

Obviously one thing I will want to do with the method outlined previously is to build a stripped down version which only iterates the String to search once for handling short token matches. And I’ll obviously need to determine the criteria for deciding under what circumstances to use which of the two eventual implementations of this method I arrive at. The NetBeans profiling tool is an obvious win here for examining assets such as memory and heap usage, and I have a nice little wrap-around timer utility which I can inject into my class for assessing relative speeds of differently constructed search parameters passed as arguments. I’ll have a look at it over the weekend and once that’s out of the way and the appropriate method is being called by a delegator method, I’ll optimise some of the syntax candy which I am already beginning to surround this method with. None of the methods are particularly pretty or built at this stage for speed, they’re built for facility of implementation and can be used as is pretty much out of the box.

The obvious initial methods to sugar up are ones which can use the generated List object e.g. the trivial (and inefficient) countTokens method & its overload which follows:


/** Syntax candy to count the number of incidences of a token in a given char sequence */

public int countTokens(StringBuffer sb, StringBuffer sbx)
    {
            List<Integer> l = findTokens(sb, sbx);
            return l.size();
     }

/** Syntax candy ph to count the number of incidences of a token (expressed as items) in a given List */

public int countTokens(List<Integer> l)
    {
            return l.size();
     }

Next up there’s a standard boolean check to see whether there are any matched tokens:


public boolean containsMatch(StringBuffer sb, StringBuffer sbx)
    {
    if (countTokens(sb, sbx) < 1 )
        {
        return false;
        }
    return true;
    }

OK that’s the rough & ready syntax candy out of the way, now let’s look at how we can leverage the information we have back in the List. Examining large strings (& this is particularly the case with large strings arriving from or in markup language formats such as HTML/XML etc)  it’s often the case that you need to know about the position of  either a single char relative to the token’s position, or alternatively another token altogether. The char based implementations for forward and reverse location are relatively simple. They both take the String to search, the offset character index point and the char which needs to be located as arguments, and look like this:


public int findPreviousMatch(StringBuffer sb, int startPt, char toLocate)
    {
    int loc = -1;
    int ctr = startPt;
    while (ctr >= 0)
    {
            ctr--;
            if (sb.charAt(ctr) == toLocate)
            {return ctr;}
    }
    return loc;
    }

public int findNextMatch(StringBuffer sb, int startPt, char toLocate)
    {
    int loc = -1;
    int ctr = startPt;
    int len = sb.length();
    while (ctr < len)
    {
            ctr++;
            if (sb.charAt(ctr) == toLocate)
            {return ctr;}
    }
    return loc;
    }

We need to do the same thing for tokens. Arguments are an int indicating the starting point to search around, the StringBuffer to search (sb) and the token to search for (sbx) expressed again as a StringBuffer.


public int findTokenPriorToOffset(int start, StringBuffer sb, StringBuffer sbx)
    {
    int loc = -1;
    if (start > sb.length())
    {start = sb.length();} // move start to be the sb.length() if start > the size of inc string

    int pos = start;
    List<Integer> l = findTokens(sb, sbx);
    if(l.size() == 0){return loc;} // no match

    // only 1 item and bigger than start? return -1
    if ((l.size() == 1) && ((Integer) l.get(1) > start) )
    {
       return loc;
    }
    // only 1 item and less than start? return token startpoint
    if ((l.size() == 1) && ((Integer) l.get(1) < start) )
    {
       return (Integer) l.get(1);
    }

    Iterator it = l.iterator();
    while(it.hasNext())
    {
    int val = (Integer) it.next();
    if (val > start){return loc;}
    if (val < start)
    {
        loc = val;
    }

    }

    return loc;
}
public int findTokenAfterOffset(int start, StringBuffer sb, StringBuffer sbx)
    {
    int loc = -1;
    if (start > sb.length())
    {        return  -1; } // won't be found.... ret -1

    int pos = start;
    List<Integer> l = findTokens(sb, sbx);
    if(l.size() == 0){       return loc; }  // no match

    // only 1 item and less than start? return -1
    if ((l.size() == 1) && ((Integer) l.get(1) < start) )
    {
       return loc;
    }
    // only 1 item and &gt; start? return token startpoint
    if ((l.size() == 1) && ((Integer) l.get(1) > start) )
    {
       return (Integer) l.get(1);
    }
    Iterator it = l.iterator();
   

 while(it.hasNext())
    {
    int val = (Integer) it.next();
       if (val > start){return val;}
    }
    // fallthrough
    return loc;
}

HashMaps and how to bend them to your will

February 26, 2010 Leave a comment

When processing large volumes of String data, the ability to use HashMaps intelligently and creatively can be a very useful skillset to have in your armoury.  Let’s consider a hypothetical situation where you want to understand the range and extent of a given author’s vocabulary (actually, it’s not so hypothetical since I have a tool for doing probabilitistic analysis of documents which utilises a number of  methods and strategies which are very similar to those which I am about to expose).

By way of illustration, let’s start with an abridgement of the first few paragraphs of the wonderful short story by Lord Dunsany, How Nuth Would Have Practiced His Art Upon the Gnoles, a cautionary tale for the imprudent if ever there were. 

Despite the advertisements of rival firms, it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr. Nuth. To those outside the magic circle
of business, his name is scarcely known; he does not need to
advertise, he is consummate. He is superior even to modern
competition, and, whatever claims they boast, his rivals know it. His
terms are moderate, so much cash down when when the goods are
delivered, so much in blackmail afterwards. [..]

It must not be thought that I am a friend of Nuth’s; on the contrary
such politics as I have are on the side of Property; and he needs no
words from me, for his position is almost unique in trade, being among
the very few that do not need to advertise.

Given that Gutenberberg source material is derived from volunteers using scanning technologies etc of varying degrees of reliability and when dealing with text of this nature (or in fact of any sort) there are inevitably going to be a number of hurdles to deal with, we will need to accommodate the preponderance of these before parsing the text into our target HashMap. Duplicate white space, a problem I have identified and pointed up in an earlier article is potentially one; typographical errors will undoubtedly occur, line separation will need to be addressed, punctuation resolved, and, in the case of your source text being a  play, dramatic instructions and character identification issues will need handling. I say that we will need to handle the greater majority of the issues up front, although the construction of a free-text document parser and the implementation of its results is often and by and large a somewhat iterative discovery process, where the text will inevitably yield up some unique problems of its own. What we want to end up with prior to throwing the text into the HashMap is a homogenous collection of largely punctuation-free (exceptions being contractions like won’t and hasn’t and hyphenated words such as co-operative) single-space delimited words which looks something not unlike the abridged representation below:

Despite the advertisements of rival firms it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr Nuth [..]  do not need to advertise

In this instance, even with the punctuation and line wraps resolved, we will still at some point have to deal with issues relating to proper nouns, capitalisation of standard words, etc.

We could quickly count the words in the text by recycling the vowel count method explained in an earlier article, substituting the single whitespace char ' ' for the char[] array of vowels, so we immediately know how many tokens we are going to have to parse but we don’t really need to nor care at this juncture since we are going to have to String.split the text and can get the wordcount back from the resultant array size in any case.


HashMap ha = new HashMap();
String[] splitWords = theDoc.split(" "); // break the doc into an array based on cleansed ws

int len = splitWords.length; // and find out how many words we have

Now we need to iterate through our array of words. The rules of this game are simple. The keys of the hashmap will be unique words, the values will be the number we have thus far encountered as we walk though the array. If the word is NOT already in the hashmap, we have to create a new key for the word and set the value to 1. If, however, the word is already in the hashmap, we just increment the current value by 1. To paraphrase those pesky meerkats on TV, simples.

OK, let’s look at how we might do the iteration, carrying on from where we left off with the code snippet above…

for (int i = 0; i < len; i++ )
    {
    String word = splitWords[i].trim(); 

    if(!ha.containsKey(word)) // if the word isn't already in the hashmap....
    {
    ha.put(word, 1); // add the word to the HashMap as a key and set the value to 1...
    }
    else // it must be in the hashmap
    {
    Integer objectInteger = (Integer) ha.get(word); // ha.get returns Object so we cast to Integer
    int counter = objectInteger.intValue(); // and thence to an int
    counter += 1; // increment the value
    ha.put(word, counter); // and put the new value back in the hashmap by key
    }

    }

The only real nasty with HashMaps is that they have a tendency, by dint of their generic nature, to produce output with methods like get and put as Objects which you will need to cast appropriately. The upside to this is that you don’t have to overly concern yourself with what type of Objects you are putting in them so long as you deal with them appropriately.

Interrogating your HashMap to see what it looks like when you’re done is a snip if your hashmap is fully populated with both keys and values (which your’s will be since you pump the value for each word when you create the key). Here’s a quick example that dumps a hashmap to screen.

public void printPopulatedHash(HashMap hamp)
{
Collection c = hamp.values(); // put all the values in c 
Collection b = hamp.keySet(); // and all the keys go in b
Iterator itr = c.iterator(); // create an iterator for each...
Iterator wd = b.iterator();
 System.out.println("HASHMAP CONTENTS");
 System.out.println("================");
while(wd.hasNext()){
System.out.println("Key: " + wd.next() + " Value: " + itr.next()); }

}

We could, of course, have obviated the need to split the text into an array and processed the corpus as a StringBuffer. This, for all the well-worn reasons would work a lot faster.

This is a fairly straightforward process with only a little array arithmetic involved and I’ll address how you’d do it now, since the grist of the article, getting stuff into and out of HashMaps has been covered. It’ll parse teh Gutenberg Hamlet as a text file to a HashMap count of unique words in under 16 milliseconds on my clunky laptop… Benchmarking this against the String[] split version earlier, the StringBuffer method comes in at +/- 15% faster, which really is not significant if you’re only processing a single document, but if you have a lot of them to do, it all adds up….

public HashMap doSbHashing(String theDoc)
{
HashMap ha = new HashMap();
StringBuffer sb = new StringBuffer(theDoc.trim());
sb.append(' '); // stick a char whitespace on the end so it will handle the last word without an unnecessary if condition;
int len = sb.length();
int ctr = 0;
boolean finished = false;
StringBuffer sbx = new StringBuffer(30); // not may words longer than this unless you're dealing in chemicals in which case fix accordingly

while (!finished)
{
       
    if ((sb.charAt(ctr) != ' '))
    {sbx.append(sb.charAt(ctr));} // add a char into the temp stringbuffer
    else
    {

    String sWord = sbx.toString(); // it must be a word becuase we hit whitespace
   
    if(!ha.containsKey(sWord)) // if there's NOT a mapping for the word as a key
    {
    ha.put(sWord, 1);
    sbx.delete(0, sbx.length());
    }
    else //the word is in the map already
    {
    
    Integer objectInteger = (Integer) ha.get(sWord);
    int counter = objectInteger.intValue();
    counter += 1;
    ha.put(sWord, counter);
    sbx.delete(0, sbx.length());
    }

    }
    ctr++;
    if (ctr == len)
    {finished = true;}
}
return ha;
}

You’ll note that the expression

ha.put(sWord, 1);
sbx.delete(0, sbx.length());

has a definite correspondence with its partner in crime

ha.put(sWord, ctr);
sbx.delete(0, sbx.length());

and its certainly one that needs to be flagged as a contender for becoming a method in its own right. Moreover in spotting this correspondence between the two snippets of code we have the first inkling that we might also start to think in terms of extending the HashMap class as well, quite possibly in a number of directions. Certainly the “put into hash/zero-length corresponding tmp variable” is a first class candidate for method re-engineering. It’s a very common occurrrence when working with char[] arrays in combination with collection objects of all sorts.

List to Array conversion optimisation

February 25, 2010 Leave a comment

I’ve started heavily messing around with profilers recently trying to scrape a millisecond here and there out of some of the more lurid and extravagant processes that my String parsers like to engage in since the gap between what is completely unfeasible and what will actually make a difference has narrowed considerably over the past week or so.

One thing which I did recently  spot as a big win was the replacement of something I wrote back in the Dark Ages by candlelight with a phoenix-quill pen in unicorn’s blood on chimaera-skin parchment, an iterative method for converting a List of Strings to an array of Strings. In the cold light of day, the recognition that one can write such convoluted and obviously inefficient nonsense and maintain it without really questioning it for so long is quite sobering. Less, in this case, to cite the poet Robert Browning, truly is more.

The new improved and much-abridged method is a) significantly shorter and altogether more obvious in what its intent is (always a sign that things are improving in one’s code!) and b) typically runs somewhere between 5 and 20 times faster in my stress harnesses dependant on the sizes and types of String it is dealing with. I’m obliged to maintain it for the time being albeit with a Deprecated annotation since it is a horribly pervasive method appearing here and there in a lot of my Dark Ages Java coding. I could just have renamed the methods but I feel that keeping them distinct will force me to revisit code that makes use of the now-deprecated method, not, in itself, a bad thing since similar monsters may be there within. I will get round to those references though…. I promise…


 /**
     * Converts a list to an array -OPTIMAL METHOD
     * @param l Inst
     * @return String array emanating from list
     */
    private String[] strArrayFromList(List<String> list)
    {
    return  list.toArray(new String[0]) ;
    }

    /**
     * Converts a list to an array DEPRECATED, use the more efficient  strArrayFromList (see above) 
     * @param lInst List to process
     * @return String array derived from incoming list
     */
    @Deprecated
    public String[] listToArray(List<String> lInst)
    {
    int k = lInst.size();
    String[] out = new String[k];
    Iterator it = lInst.iterator();
    int ctr = 0;
    while( it.hasNext())
    {
    out[ctr] = it.next().toString();
    ctr++; }
     return out;
    }

Fast (StringBuffer-centric) location of substrings

February 20, 2010 Leave a comment

Over the past few days I’ve extended and refactored my StringBuffer-oriented fast String search classes a little. I’ve got to say that the performance improvement over regexps in dissecting large String objects (and we’re talking gigs of Strings that I’m playing with here) is paying significant dividends. Also, constant refactoring is a paradigm I’ve paid much attention to over the years, there’s no such thing in my book as “perfect code”, there is only code which exposes itself as being in less need of improvement than other code. One piece of code that was high on my hit list was the locateStringContents method, which was ripe for extension to be used to locate substrings and not just single chars. The first thing that struck me on revisiting the method in question was that I had done something which I ought not to have done, which was to call a function unnecessarily in a for loop (here’s the code down to the offending line)

public int[] locateStringContents(StringBuffer sb, StringBuffer sbCheckable)   {
    List l = new ArrayList();
    for (int i = 0; i < sb.length(); i++)
[..]

So I fixed that so that it now looks like this (and I had compounded the error in an inner loop with a similar call to sbCheckable.length so I fixed that at the point where it would only be called once at the same time).

public int[] locateStringContents(StringBuffer sb, StringBuffer sbCheckable)    {
    List l = new ArrayList();
    int iCheckSize = sbCheckable.length();
    int iSbSize = sb.length();
    for (int i = 0; i < iSbSize; i++)
[..]

These look like small and superficial changes but when iterated over several hundred million times the time-savings can be significant and it was important to pick up whilst optimising my code to make it bleed. I’m now systematically doing similar clean-ups on a number of methods which contain similar lapses of coding. It’s all too easily done, particularly when working against tight timescales, and subsequent revisitation of code and classes which are substantial consumers of processor resources can often pay big dividends.

After the cleanup I turned my attention to what needed to be done to do some fast substring location. I approached this initially from a linear point of view with a nested looping construct which matched character for character but the end result was frankly horrible to look at as a piece of code and moreover inefficient by comparison to my subsequent approach to the problem which was to deal with the problem pragmatically and eradicate the complex jungle of nested loops and conditional evaluations with a boolean method, containsMatch, which receives a slice of the right amount of String to examine on matching the first char and then evaluates the slice by comparison to the comparator sequence.


/**
*Method to expose all int start positions of a substring within a greater String, returning an int array containing their
* locations
**/
public int[] locateSubstrings(StringBuffer sb, StringBuffer sbCheckable)
{
List l = new ArrayList();
int iCheckSize = sbCheckable.length();
int iSbSize = sb.length() - iCheckSize; // since we don't want array overruns

for (int i = 0; i < iSbSize; i++)
{
if (containsMatch(sb.subSequence(i, i + iCheckSize),sbCheckable))
{
l.add(i);
continue;
}
}
return toIntArray(l);

 }

/**
*Method to determine whether a sliced CharSequence cs, derived from a String to evaluate, is equivalent to a comparative StringBuffer
**/
public boolean containsMatch(CharSequence cs, StringBuffer sCheckable)
{
int csLength = cs.length();
for (int i = 0; i < csLength; i++)
{
if (cs.charAt(i) != sCheckable.charAt(i)) { return false;} // fail if any one of the chars does not match
}
return true; // didn't fail ergo must be a match
}

public int[] toIntArray(List integerList) {
int iSize = integerList.size();
int[] intArray = new int[iSize];

for (int i = 0; i < iSize; i++) {
intArray[i] = integerList.get(i);
}
return intArray;
}