Archive

Posts Tagged ‘HashMaps’

Inspecting HashMaps

November 7, 2010 Leave a comment

One query which seems to turn up frequently in the blog stat reports is how to iterate or print a HashMap so that its contents are visible. The bottom line is that HashMap itself doesn’t directly support an iterator. However the getKeys() method of HashMap returns a Set object which can be iterated without any problem.

public void printHashMap(HashMap ha)
{
Set keys = ha.keySet();
Iterator it = keys.iterator();

String sx = "";

while (it.hasNext()) {
          Object key = it.next();
        String sKey = key.toString();
        Object val = ha.get(key);
        String kVal = val.toString();
        sx += sKey + " - " + kVal + "\n";

    }
    System.out.println(sx);

}

Obviously once you’ve got the trick of handing off the iteration of a HashMap to the Set returned by getKeys() life gets altogether more comfortable; no longer is that HashMap a ferociously daunting object lurking in memory,a monster which you can’t be certain at any time as to what it contains, but it becomes an altogether more amenable and house-trained creature.

I’ll get round to a few more of these over the coming weeks, in between writing some more semantic parsing code which I’ll also be blogging about.

Categories: Java Tags: , ,

HashMap->ArrayList breakdown and reconstitution of documents in Java

March 2, 2010 Leave a comment

Given the general utility of HashMaps with ArrayLists as values, as illustrated in the earlier post HashMap key -> ArrayList, it’s fairly obvious that we can in most cases (actually all, but I’m open to persuasion that somewhere an exception may exist) represent a document, such as a short story by Lord Dunsany, a play by William Shakespeare, an XML or Soap document, in fact any document in which repetitious tokens occur or are likely to occur, as a set of keys indicating the tokens, and an arraylist of their relative positions in the document.

Let’s as an example take a block of text such as that found in Churchill’s speech of November 10, 1942 in the wake of Alexander and Montgomery repelling Rommel’s thrust at Alexandria: “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” It’s short, sweet, and contains the criteria we need for a good example, notably multiple repetitive tokens. For the purposes of this examination we’ll totally neuter the punctuation and capitalisation so that we just have tokens:now this is not the end it is not even the beginning of the end but it is perhaps the end of the beginning. At this juncture it is still recognisable as Churchill’s text. Let’s turn this into a reproducible hashmap in which each unique word is a key to a value set which is an ArrayList containing instances of the positions of the word in the text which we supplied it with. I won’t bore you with repeating the addArrayListItem method (it’s in the post immediately below this one…) and the printOutHashMap method is somewhat self-evident but included for compeleteness.

public void someBoringMethodSig()
{
String inString = "now this is not the end it is not even the beginning of the end ";
inString += "but it is perhaps the end of the beginning";

HashMap ha = new HashMap();

String[] tok = inString.split(" ");
int k = tok.length;

 for (int i = 0; i< k; i++)
        {
        addArrayListItem(ha,tok[i],i );
        }

    printOutHashMap(ha);
   // assemble(ha);
    }
public void printOutHashMap(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();
Iterator itr = c.iterator();
Iterator wd = b.iterator();
    System.out.println("HASHMAP CONTENTS");
    System.out.println("================");
while(itr.hasNext()){
System.out.print("Key: >>" + wd.next() + "<< Values: " );//+ itr.next()); }
    ArrayList x = (ArrayList) itr.next();
    Iterator lit = x.iterator();
    while (lit.hasNext())
    {System.out.print(lit.next() + " ");

    }
    System.out.println("");
}

}

This will give us an output which looks very much like

HASHMAP CONTENTS
================
Key: >>not<< Values: 3 8
Key: >>of<< Values: 12 21
Key: >>but<< Values: 15
Key: >>is<< Values: 2 7 17
Key: >>beginning<< Values: 11 23
Key: >>it<< Values: 6 16
Key: >>now<< Values: 0
Key: >>even<< Values: 9
Key: >>the<< Values: 4 10 13 19 22
Key: >>perhaps<< Values: 18
Key: >>this<< Values: 1
Key: >>end<< Values: 5 14 20

We can obviously reassemble this by recourse to the values, either in forward (starting at 0) or in reverse order (finding the max int value) and reassembling in a decremental loop. The implications of this are significant when one considers the potential applications of this sort of document treatment from a number of perspectives, notably IT security, improved data transmissibility by reduction, cryptography (where one would not transmit the keys but a serializable numeric indicator to a separately transmitted (or held) common dictionary, etc, etc.

Generally the reassembly process looks something like the following :


public void assemble(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();

int posCtr = 0;

int max = 0;

Iterator xAll = hamp.values().iterator();
while(xAll.hasNext()){
  ArrayList xxx = (ArrayList) xAll.next();
  max += xxx.size();
  }

String stOut = "";
boolean unfinished = true;

while (unfinished)
{
Iterator itr = c.iterator();
Iterator wd = b.iterator();
start:
while(unfinished){

String tmp = (String) wd.next() ; // next key
ArrayList x = (ArrayList) itr.next(); // next ArrayList
Iterator lit = x.iterator(); // next ArrayList iterator

while (lit.hasNext())
{
//System.out.println(lit.next());
if((Integer.valueOf((Integer) lit.next()) == posCtr )) // if it matches the positional counter then this is the word we want...
{
stOut = stOut + " " + tmp ; // add the word in tmp to the output String
posCtr++;

wd = b.iterator(); //don't forget to reset the iterators to the beginning!
itr = c.iterator();
if(posCtr == max) // last word to deal with....
{
System.out.println("FINISHED");
unfinished = false;}
break start; // otherwise goto the start point, rinse, repeat (irl this would be handled more elegantly, illustrative code here!)

}

}

}
}
System.out.println("ASSEMBLED: " + stOut);
}

This can obviously be improved in a number of ways, particularly moving from a String-based paradigm to one in which char[] array representations replace Strings in the HashMap.

HashMap key -> ArrayList

March 1, 2010 Leave a comment

In the earlier and gentle introduction article on HashMaps, HashMaps and how to bend them to your will, I looked at how we might use a HashMap to do vocabulary count assessment of a passed String document. HashMaps are, as we have seen, ostensibly dumb key-value sets. I say ostensibly: the given key or value does not however have to be a single item as HashMap effects storage in both value and key as Object types. From a programmatic perspective this realisation has significant implications.

Let’s look at how we might use a keyed HashMap paired with values expressed as ArrayLists. The following methods will give you an idea of how to put int values in per key, the first method using a single value, the second handling an array of values. Obviously these methods are ripe for extension, exception handling and overloading but they are stripped down here from an illustrative point of view:


public HashMap<String, ArrayList> addArrayListItem(HashMap<String, ArrayList> hamp,String sKey,int value )
{
    if(!hamp.containsKey(sKey)) // no key found?
    {
    ArrayList lx = new ArrayList();
    lx.add(value);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the (new) value to the existing arraylist
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.add(value);
    hamp.put(sKey, lx);
    }
    return hamp;
}

public HashMap<String, ArrayList> addArrayListItems(HashMap<String, ArrayList> hamp,String sKey,int[] value )
{
    int iSize = value.length;
    ArrayList iList = (ArrayList) Arrays.asList(value);

    if(!hamp.containsKey(sKey)) // no key found? // create key and add list
    {
    ArrayList lx = new ArrayList();
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the array of values to the existing list
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }

    return hamp;
}

The getter method to retrieve by key is even easier (although again this method is open to extension, overloading, exception handling etc):

public ArrayList<int[]> getItems(HashMap<String, ArrayList> hamp,String sKey)
{
return hamp.get(sKey);
}

We can moreover use an ArrayList in an arbitrary, or, indeed, in a very carefully structured, fashion, e.g. to store not only integers but Strings, booleans, whatever, in order to achieve our purpose. Again, an untyped ArrayList is Object type agnostic as the following snippet of dummy code illustrates.

        ArrayList al = new ArrayList();
        int i = 1;
        String s = "Monty Python";
        boolean isVerySilly = true;
        al.add(i);
        al.add(s);
        al.add(isVerySilly);
        Iterator it = al.iterator();
        while (it.hasNext())
        {
            System.out.println(it.next());
        }

The realisation of this has similarly significant implications for the intelligent programmer. Exploiting the HashMap/ArrayList (or similar collection type) combination allows you to do some very powerful things with the data coming your way.

HashMaps and how to bend them to your will

February 26, 2010 Leave a comment

When processing large volumes of String data, the ability to use HashMaps intelligently and creatively can be a very useful skillset to have in your armoury.  Let’s consider a hypothetical situation where you want to understand the range and extent of a given author’s vocabulary (actually, it’s not so hypothetical since I have a tool for doing probabilitistic analysis of documents which utilises a number of  methods and strategies which are very similar to those which I am about to expose).

By way of illustration, let’s start with an abridgement of the first few paragraphs of the wonderful short story by Lord Dunsany, How Nuth Would Have Practiced His Art Upon the Gnoles, a cautionary tale for the imprudent if ever there were. 

Despite the advertisements of rival firms, it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr. Nuth. To those outside the magic circle
of business, his name is scarcely known; he does not need to
advertise, he is consummate. He is superior even to modern
competition, and, whatever claims they boast, his rivals know it. His
terms are moderate, so much cash down when when the goods are
delivered, so much in blackmail afterwards. [..]

It must not be thought that I am a friend of Nuth’s; on the contrary
such politics as I have are on the side of Property; and he needs no
words from me, for his position is almost unique in trade, being among
the very few that do not need to advertise.

Given that Gutenberberg source material is derived from volunteers using scanning technologies etc of varying degrees of reliability and when dealing with text of this nature (or in fact of any sort) there are inevitably going to be a number of hurdles to deal with, we will need to accommodate the preponderance of these before parsing the text into our target HashMap. Duplicate white space, a problem I have identified and pointed up in an earlier article is potentially one; typographical errors will undoubtedly occur, line separation will need to be addressed, punctuation resolved, and, in the case of your source text being a  play, dramatic instructions and character identification issues will need handling. I say that we will need to handle the greater majority of the issues up front, although the construction of a free-text document parser and the implementation of its results is often and by and large a somewhat iterative discovery process, where the text will inevitably yield up some unique problems of its own. What we want to end up with prior to throwing the text into the HashMap is a homogenous collection of largely punctuation-free (exceptions being contractions like won’t and hasn’t and hyphenated words such as co-operative) single-space delimited words which looks something not unlike the abridged representation below:

Despite the advertisements of rival firms it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr Nuth [..]  do not need to advertise

In this instance, even with the punctuation and line wraps resolved, we will still at some point have to deal with issues relating to proper nouns, capitalisation of standard words, etc.

We could quickly count the words in the text by recycling the vowel count method explained in an earlier article, substituting the single whitespace char ' ' for the char[] array of vowels, so we immediately know how many tokens we are going to have to parse but we don’t really need to nor care at this juncture since we are going to have to String.split the text and can get the wordcount back from the resultant array size in any case.


HashMap ha = new HashMap();
String[] splitWords = theDoc.split(" "); // break the doc into an array based on cleansed ws

int len = splitWords.length; // and find out how many words we have

Now we need to iterate through our array of words. The rules of this game are simple. The keys of the hashmap will be unique words, the values will be the number we have thus far encountered as we walk though the array. If the word is NOT already in the hashmap, we have to create a new key for the word and set the value to 1. If, however, the word is already in the hashmap, we just increment the current value by 1. To paraphrase those pesky meerkats on TV, simples.

OK, let’s look at how we might do the iteration, carrying on from where we left off with the code snippet above…

for (int i = 0; i < len; i++ )
    {
    String word = splitWords[i].trim(); 

    if(!ha.containsKey(word)) // if the word isn't already in the hashmap....
    {
    ha.put(word, 1); // add the word to the HashMap as a key and set the value to 1...
    }
    else // it must be in the hashmap
    {
    Integer objectInteger = (Integer) ha.get(word); // ha.get returns Object so we cast to Integer
    int counter = objectInteger.intValue(); // and thence to an int
    counter += 1; // increment the value
    ha.put(word, counter); // and put the new value back in the hashmap by key
    }

    }

The only real nasty with HashMaps is that they have a tendency, by dint of their generic nature, to produce output with methods like get and put as Objects which you will need to cast appropriately. The upside to this is that you don’t have to overly concern yourself with what type of Objects you are putting in them so long as you deal with them appropriately.

Interrogating your HashMap to see what it looks like when you’re done is a snip if your hashmap is fully populated with both keys and values (which your’s will be since you pump the value for each word when you create the key). Here’s a quick example that dumps a hashmap to screen.

public void printPopulatedHash(HashMap hamp)
{
Collection c = hamp.values(); // put all the values in c 
Collection b = hamp.keySet(); // and all the keys go in b
Iterator itr = c.iterator(); // create an iterator for each...
Iterator wd = b.iterator();
 System.out.println("HASHMAP CONTENTS");
 System.out.println("================");
while(wd.hasNext()){
System.out.println("Key: " + wd.next() + " Value: " + itr.next()); }

}

We could, of course, have obviated the need to split the text into an array and processed the corpus as a StringBuffer. This, for all the well-worn reasons would work a lot faster.

This is a fairly straightforward process with only a little array arithmetic involved and I’ll address how you’d do it now, since the grist of the article, getting stuff into and out of HashMaps has been covered. It’ll parse teh Gutenberg Hamlet as a text file to a HashMap count of unique words in under 16 milliseconds on my clunky laptop… Benchmarking this against the String[] split version earlier, the StringBuffer method comes in at +/- 15% faster, which really is not significant if you’re only processing a single document, but if you have a lot of them to do, it all adds up….

public HashMap doSbHashing(String theDoc)
{
HashMap ha = new HashMap();
StringBuffer sb = new StringBuffer(theDoc.trim());
sb.append(' '); // stick a char whitespace on the end so it will handle the last word without an unnecessary if condition;
int len = sb.length();
int ctr = 0;
boolean finished = false;
StringBuffer sbx = new StringBuffer(30); // not may words longer than this unless you're dealing in chemicals in which case fix accordingly

while (!finished)
{
       
    if ((sb.charAt(ctr) != ' '))
    {sbx.append(sb.charAt(ctr));} // add a char into the temp stringbuffer
    else
    {

    String sWord = sbx.toString(); // it must be a word becuase we hit whitespace
   
    if(!ha.containsKey(sWord)) // if there's NOT a mapping for the word as a key
    {
    ha.put(sWord, 1);
    sbx.delete(0, sbx.length());
    }
    else //the word is in the map already
    {
    
    Integer objectInteger = (Integer) ha.get(sWord);
    int counter = objectInteger.intValue();
    counter += 1;
    ha.put(sWord, counter);
    sbx.delete(0, sbx.length());
    }

    }
    ctr++;
    if (ctr == len)
    {finished = true;}
}
return ha;
}

You’ll note that the expression

ha.put(sWord, 1);
sbx.delete(0, sbx.length());

has a definite correspondence with its partner in crime

ha.put(sWord, ctr);
sbx.delete(0, sbx.length());

and its certainly one that needs to be flagged as a contender for becoming a method in its own right. Moreover in spotting this correspondence between the two snippets of code we have the first inkling that we might also start to think in terms of extending the HashMap class as well, quite possibly in a number of directions. Certainly the “put into hash/zero-length corresponding tmp variable” is a first class candidate for method re-engineering. It’s a very common occurrrence when working with char[] arrays in combination with collection objects of all sorts.