Home > Java > HashMaps and how to bend them to your will

HashMaps and how to bend them to your will

When processing large volumes of String data, the ability to use HashMaps intelligently and creatively can be a very useful skillset to have in your armoury.  Let’s consider a hypothetical situation where you want to understand the range and extent of a given author’s vocabulary (actually, it’s not so hypothetical since I have a tool for doing probabilitistic analysis of documents which utilises a number of  methods and strategies which are very similar to those which I am about to expose).

By way of illustration, let’s start with an abridgement of the first few paragraphs of the wonderful short story by Lord Dunsany, How Nuth Would Have Practiced His Art Upon the Gnoles, a cautionary tale for the imprudent if ever there were. 

Despite the advertisements of rival firms, it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr. Nuth. To those outside the magic circle
of business, his name is scarcely known; he does not need to
advertise, he is consummate. He is superior even to modern
competition, and, whatever claims they boast, his rivals know it. His
terms are moderate, so much cash down when when the goods are
delivered, so much in blackmail afterwards. [..]

It must not be thought that I am a friend of Nuth’s; on the contrary
such politics as I have are on the side of Property; and he needs no
words from me, for his position is almost unique in trade, being among
the very few that do not need to advertise.

Given that Gutenberberg source material is derived from volunteers using scanning technologies etc of varying degrees of reliability and when dealing with text of this nature (or in fact of any sort) there are inevitably going to be a number of hurdles to deal with, we will need to accommodate the preponderance of these before parsing the text into our target HashMap. Duplicate white space, a problem I have identified and pointed up in an earlier article is potentially one; typographical errors will undoubtedly occur, line separation will need to be addressed, punctuation resolved, and, in the case of your source text being a  play, dramatic instructions and character identification issues will need handling. I say that we will need to handle the greater majority of the issues up front, although the construction of a free-text document parser and the implementation of its results is often and by and large a somewhat iterative discovery process, where the text will inevitably yield up some unique problems of its own. What we want to end up with prior to throwing the text into the HashMap is a homogenous collection of largely punctuation-free (exceptions being contractions like won’t and hasn’t and hyphenated words such as co-operative) single-space delimited words which looks something not unlike the abridged representation below:

Despite the advertisements of rival firms it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr Nuth [..]  do not need to advertise

In this instance, even with the punctuation and line wraps resolved, we will still at some point have to deal with issues relating to proper nouns, capitalisation of standard words, etc.

We could quickly count the words in the text by recycling the vowel count method explained in an earlier article, substituting the single whitespace char ' ' for the char[] array of vowels, so we immediately know how many tokens we are going to have to parse but we don’t really need to nor care at this juncture since we are going to have to String.split the text and can get the wordcount back from the resultant array size in any case.


HashMap ha = new HashMap();
String[] splitWords = theDoc.split(" "); // break the doc into an array based on cleansed ws

int len = splitWords.length; // and find out how many words we have

Now we need to iterate through our array of words. The rules of this game are simple. The keys of the hashmap will be unique words, the values will be the number we have thus far encountered as we walk though the array. If the word is NOT already in the hashmap, we have to create a new key for the word and set the value to 1. If, however, the word is already in the hashmap, we just increment the current value by 1. To paraphrase those pesky meerkats on TV, simples.

OK, let’s look at how we might do the iteration, carrying on from where we left off with the code snippet above…

for (int i = 0; i < len; i++ )
    {
    String word = splitWords[i].trim(); 

    if(!ha.containsKey(word)) // if the word isn't already in the hashmap....
    {
    ha.put(word, 1); // add the word to the HashMap as a key and set the value to 1...
    }
    else // it must be in the hashmap
    {
    Integer objectInteger = (Integer) ha.get(word); // ha.get returns Object so we cast to Integer
    int counter = objectInteger.intValue(); // and thence to an int
    counter += 1; // increment the value
    ha.put(word, counter); // and put the new value back in the hashmap by key
    }

    }

The only real nasty with HashMaps is that they have a tendency, by dint of their generic nature, to produce output with methods like get and put as Objects which you will need to cast appropriately. The upside to this is that you don’t have to overly concern yourself with what type of Objects you are putting in them so long as you deal with them appropriately.

Interrogating your HashMap to see what it looks like when you’re done is a snip if your hashmap is fully populated with both keys and values (which your’s will be since you pump the value for each word when you create the key). Here’s a quick example that dumps a hashmap to screen.

public void printPopulatedHash(HashMap hamp)
{
Collection c = hamp.values(); // put all the values in c 
Collection b = hamp.keySet(); // and all the keys go in b
Iterator itr = c.iterator(); // create an iterator for each...
Iterator wd = b.iterator();
 System.out.println("HASHMAP CONTENTS");
 System.out.println("================");
while(wd.hasNext()){
System.out.println("Key: " + wd.next() + " Value: " + itr.next()); }

}

We could, of course, have obviated the need to split the text into an array and processed the corpus as a StringBuffer. This, for all the well-worn reasons would work a lot faster.

This is a fairly straightforward process with only a little array arithmetic involved and I’ll address how you’d do it now, since the grist of the article, getting stuff into and out of HashMaps has been covered. It’ll parse teh Gutenberg Hamlet as a text file to a HashMap count of unique words in under 16 milliseconds on my clunky laptop… Benchmarking this against the String[] split version earlier, the StringBuffer method comes in at +/- 15% faster, which really is not significant if you’re only processing a single document, but if you have a lot of them to do, it all adds up….

public HashMap doSbHashing(String theDoc)
{
HashMap ha = new HashMap();
StringBuffer sb = new StringBuffer(theDoc.trim());
sb.append(' '); // stick a char whitespace on the end so it will handle the last word without an unnecessary if condition;
int len = sb.length();
int ctr = 0;
boolean finished = false;
StringBuffer sbx = new StringBuffer(30); // not may words longer than this unless you're dealing in chemicals in which case fix accordingly

while (!finished)
{
       
    if ((sb.charAt(ctr) != ' '))
    {sbx.append(sb.charAt(ctr));} // add a char into the temp stringbuffer
    else
    {

    String sWord = sbx.toString(); // it must be a word becuase we hit whitespace
   
    if(!ha.containsKey(sWord)) // if there's NOT a mapping for the word as a key
    {
    ha.put(sWord, 1);
    sbx.delete(0, sbx.length());
    }
    else //the word is in the map already
    {
    
    Integer objectInteger = (Integer) ha.get(sWord);
    int counter = objectInteger.intValue();
    counter += 1;
    ha.put(sWord, counter);
    sbx.delete(0, sbx.length());
    }

    }
    ctr++;
    if (ctr == len)
    {finished = true;}
}
return ha;
}

You’ll note that the expression

ha.put(sWord, 1);
sbx.delete(0, sbx.length());

has a definite correspondence with its partner in crime

ha.put(sWord, ctr);
sbx.delete(0, sbx.length());

and its certainly one that needs to be flagged as a contender for becoming a method in its own right. Moreover in spotting this correspondence between the two snippets of code we have the first inkling that we might also start to think in terms of extending the HashMap class as well, quite possibly in a number of directions. Certainly the “put into hash/zero-length corresponding tmp variable” is a first class candidate for method re-engineering. It’s a very common occurrrence when working with char[] arrays in combination with collection objects of all sorts.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: