arrays | The Darker Side of Programming

HashMap key -> ArrayList

March 1, 2010 syzygy Leave a comment

In the earlier and gentle introduction article on HashMaps, HashMaps and how to bend them to your will, I looked at how we might use a HashMap to do vocabulary count assessment of a passed String document. HashMaps are, as we have seen, ostensibly dumb key-value sets. I say ostensibly: the given key or value does not however have to be a single item as HashMap effects storage in both value and key as Object types. From a programmatic perspective this realisation has significant implications.

Let’s look at how we might use a keyed HashMap paired with values expressed as ArrayLists. The following methods will give you an idea of how to put int values in per key, the first method using a single value, the second handling an array of values. Obviously these methods are ripe for extension, exception handling and overloading but they are stripped down here from an illustrative point of view:


public HashMap<String, ArrayList> addArrayListItem(HashMap<String, ArrayList> hamp,String sKey,int value )
{
    if(!hamp.containsKey(sKey)) // no key found?
    {
    ArrayList lx = new ArrayList();
    lx.add(value);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the (new) value to the existing arraylist
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.add(value);
    hamp.put(sKey, lx);
    }
    return hamp;
}

public HashMap<String, ArrayList> addArrayListItems(HashMap<String, ArrayList> hamp,String sKey,int[] value )
{
    int iSize = value.length;
    ArrayList iList = (ArrayList) Arrays.asList(value);

    if(!hamp.containsKey(sKey)) // no key found? // create key and add list
    {
    ArrayList lx = new ArrayList();
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the array of values to the existing list
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }

    return hamp;
}

The getter method to retrieve by key is even easier (although again this method is open to extension, overloading, exception handling etc):

public ArrayList<int[]> getItems(HashMap<String, ArrayList> hamp,String sKey)
{
return hamp.get(sKey);
}

We can moreover use an ArrayList in an arbitrary, or, indeed, in a very carefully structured, fashion, e.g. to store not only integers but Strings, booleans, whatever, in order to achieve our purpose. Again, an untyped ArrayList is Object type agnostic as the following snippet of dummy code illustrates.

        ArrayList al = new ArrayList();
        int i = 1;
        String s = "Monty Python";
        boolean isVerySilly = true;
        al.add(i);
        al.add(s);
        al.add(isVerySilly);
        Iterator it = al.iterator();
        while (it.hasNext())
        {
            System.out.println(it.next());
        }

The realisation of this has similarly significant implications for the intelligent programmer. Exploiting the HashMap/ArrayList (or similar collection type) combination allows you to do some very powerful things with the data coming your way.

Categories: Java, Uncategorized Tags: ArrayList, arrays, HashMaps, Java, String Manipulation

HashMaps and how to bend them to your will

February 26, 2010 syzygy Leave a comment

When processing large volumes of String data, the ability to use HashMaps intelligently and creatively can be a very useful skillset to have in your armoury. Let’s consider a hypothetical situation where you want to understand the range and extent of a given author’s vocabulary (actually, it’s not so hypothetical since I have a tool for doing probabilitistic analysis of documents which utilises a number of methods and strategies which are very similar to those which I am about to expose).

By way of illustration, let’s start with an abridgement of the first few paragraphs of the wonderful short story by Lord Dunsany, How Nuth Would Have Practiced His Art Upon the Gnoles, a cautionary tale for the imprudent if ever there were.

Despite the advertisements of rival firms, it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr. Nuth. To those outside the magic circle
of business, his name is scarcely known; he does not need to
advertise, he is consummate. He is superior even to modern
competition, and, whatever claims they boast, his rivals know it. His
terms are moderate, so much cash down when when the goods are
delivered, so much in blackmail afterwards. [..]

It must not be thought that I am a friend of Nuth’s; on the contrary
such politics as I have are on the side of Property; and he needs no
words from me, for his position is almost unique in trade, being among
the very few that do not need to advertise.

Given that Gutenberberg source material is derived from volunteers using scanning technologies etc of varying degrees of reliability and when dealing with text of this nature (or in fact of any sort) there are inevitably going to be a number of hurdles to deal with, we will need to accommodate the preponderance of these before parsing the text into our target HashMap. Duplicate white space, a problem I have identified and pointed up in an earlier article is potentially one; typographical errors will undoubtedly occur, line separation will need to be addressed, punctuation resolved, and, in the case of your source text being a play, dramatic instructions and character identification issues will need handling. I say that we will need to handle the greater majority of the issues up front, although the construction of a free-text document parser and the implementation of its results is often and by and large a somewhat iterative discovery process, where the text will inevitably yield up some unique problems of its own. What we want to end up with prior to throwing the text into the HashMap is a homogenous collection of largely punctuation-free (exceptions being contractions like won’t and hasn’t and hyphenated words such as co-operative) single-space delimited words which looks something not unlike the abridged representation below:

Despite the advertisements of rival firms it is probable that every
tradesman knows that nobody in business at the present time has a
position equal to that of Mr Nuth [..] do not need to advertise

In this instance, even with the punctuation and line wraps resolved, we will still at some point have to deal with issues relating to proper nouns, capitalisation of standard words, etc.

We could quickly count the words in the text by recycling the vowel count method explained in an earlier article, substituting the single whitespace char ' ' for the char[] array of vowels, so we immediately know how many tokens we are going to have to parse but we don’t really need to nor care at this juncture since we are going to have to String.split the text and can get the wordcount back from the resultant array size in any case.


HashMap ha = new HashMap();
String[] splitWords = theDoc.split(" "); // break the doc into an array based on cleansed ws

int len = splitWords.length; // and find out how many words we have

Now we need to iterate through our array of words. The rules of this game are simple. The keys of the hashmap will be unique words, the values will be the number we have thus far encountered as we walk though the array. If the word is NOT already in the hashmap, we have to create a new key for the word and set the value to 1. If, however, the word is already in the hashmap, we just increment the current value by 1. To paraphrase those pesky meerkats on TV, simples.

OK, let’s look at how we might do the iteration, carrying on from where we left off with the code snippet above…

for (int i = 0; i < len; i++ )
    {
    String word = splitWords[i].trim(); 

    if(!ha.containsKey(word)) // if the word isn't already in the hashmap....
    {
    ha.put(word, 1); // add the word to the HashMap as a key and set the value to 1...
    }
    else // it must be in the hashmap
    {
    Integer objectInteger = (Integer) ha.get(word); // ha.get returns Object so we cast to Integer
    int counter = objectInteger.intValue(); // and thence to an int
    counter += 1; // increment the value
    ha.put(word, counter); // and put the new value back in the hashmap by key
    }

    }

The only real nasty with HashMaps is that they have a tendency, by dint of their generic nature, to produce output with methods like get and put as Objects which you will need to cast appropriately. The upside to this is that you don’t have to overly concern yourself with what type of Objects you are putting in them so long as you deal with them appropriately.

Interrogating your HashMap to see what it looks like when you’re done is a snip if your hashmap is fully populated with both keys and values (which your’s will be since you pump the value for each word when you create the key). Here’s a quick example that dumps a hashmap to screen.

public void printPopulatedHash(HashMap hamp)
{
Collection c = hamp.values(); // put all the values in c 
Collection b = hamp.keySet(); // and all the keys go in b
Iterator itr = c.iterator(); // create an iterator for each...
Iterator wd = b.iterator();
 System.out.println("HASHMAP CONTENTS");
 System.out.println("================");
while(wd.hasNext()){
System.out.println("Key: " + wd.next() + " Value: " + itr.next()); }

}

We could, of course, have obviated the need to split the text into an array and processed the corpus as a StringBuffer. This, for all the well-worn reasons would work a lot faster.

This is a fairly straightforward process with only a little array arithmetic involved and I’ll address how you’d do it now, since the grist of the article, getting stuff into and out of HashMaps has been covered. It’ll parse teh Gutenberg Hamlet as a text file to a HashMap count of unique words in under 16 milliseconds on my clunky laptop… Benchmarking this against the String[] split version earlier, the StringBuffer method comes in at +/- 15% faster, which really is not significant if you’re only processing a single document, but if you have a lot of them to do, it all adds up….

public HashMap doSbHashing(String theDoc)
{
HashMap ha = new HashMap();
StringBuffer sb = new StringBuffer(theDoc.trim());
sb.append(' '); // stick a char whitespace on the end so it will handle the last word without an unnecessary if condition;
int len = sb.length();
int ctr = 0;
boolean finished = false;
StringBuffer sbx = new StringBuffer(30); // not may words longer than this unless you're dealing in chemicals in which case fix accordingly

while (!finished)
{
       
    if ((sb.charAt(ctr) != ' '))
    {sbx.append(sb.charAt(ctr));} // add a char into the temp stringbuffer
    else
    {

    String sWord = sbx.toString(); // it must be a word becuase we hit whitespace
   
    if(!ha.containsKey(sWord)) // if there's NOT a mapping for the word as a key
    {
    ha.put(sWord, 1);
    sbx.delete(0, sbx.length());
    }
    else //the word is in the map already
    {
    
    Integer objectInteger = (Integer) ha.get(sWord);
    int counter = objectInteger.intValue();
    counter += 1;
    ha.put(sWord, counter);
    sbx.delete(0, sbx.length());
    }

    }
    ctr++;
    if (ctr == len)
    {finished = true;}
}
return ha;
}

You’ll note that the expression

ha.put(sWord, 1);
sbx.delete(0, sbx.length());

has a definite correspondence with its partner in crime

ha.put(sWord, ctr);
sbx.delete(0, sbx.length());

and its certainly one that needs to be flagged as a contender for becoming a method in its own right. Moreover in spotting this correspondence between the two snippets of code we have the first inkling that we might also start to think in terms of extending the HashMap class as well, quite possibly in a number of directions. Certainly the “put into hash/zero-length corresponding tmp variable” is a first class candidate for method re-engineering. It’s a very common occurrrence when working with char[] arrays in combination with collection objects of all sorts.

Categories: Java Tags: arrays, HashMaps, Java, refactoring, String Manipulation, StringBuffer

List to Array conversion optimisation

February 25, 2010 syzygy Leave a comment

I’ve started heavily messing around with profilers recently trying to scrape a millisecond here and there out of some of the more lurid and extravagant processes that my String parsers like to engage in since the gap between what is completely unfeasible and what will actually make a difference has narrowed considerably over the past week or so.

One thing which I did recently spot as a big win was the replacement of something I wrote back in the Dark Ages by candlelight with a phoenix-quill pen in unicorn’s blood on chimaera-skin parchment, an iterative method for converting a List of Strings to an array of Strings. In the cold light of day, the recognition that one can write such convoluted and obviously inefficient nonsense and maintain it without really questioning it for so long is quite sobering. Less, in this case, to cite the poet Robert Browning, truly is more.

The new improved and much-abridged method is a) significantly shorter and altogether more obvious in what its intent is (always a sign that things are improving in one’s code!) and b) typically runs somewhere between 5 and 20 times faster in my stress harnesses dependant on the sizes and types of String it is dealing with. I’m obliged to maintain it for the time being albeit with a Deprecated annotation since it is a horribly pervasive method appearing here and there in a lot of my Dark Ages Java coding. I could just have renamed the methods but I feel that keeping them distinct will force me to revisit code that makes use of the now-deprecated method, not, in itself, a bad thing since similar monsters may be there within. I will get round to those references though…. I promise…


 /**
     * Converts a list to an array -OPTIMAL METHOD
     * @param l Inst
     * @return String array emanating from list
     */
    private String[] strArrayFromList(List<String> list)
    {
    return  list.toArray(new String[0]) ;
    }

    /**
     * Converts a list to an array DEPRECATED, use the more efficient  strArrayFromList (see above) 
     * @param lInst List to process
     * @return String array derived from incoming list
     */
    @Deprecated
    public String[] listToArray(List<String> lInst)
    {
    int k = lInst.size();
    String[] out = new String[k];
    Iterator it = lInst.iterator();
    int ctr = 0;
    while( it.hasNext())
    {
    out[ctr] = it.next().toString();
    ctr++; }
     return out;
    }

Categories: Java Tags: arrays, Java, refactoring, String Manipulation

Counting vowels… the faster AND dynamically reusable way…

February 5, 2010 syzygy 2 comments

OK this is a better solution than the earlier one (I’ve woken up and consumed the correct dose of coffee to make my brain operate at near-normal levels now). The reason it’s better of course is because it replaces context-specific code with code which is reusable beyond the narrow confines of its original context.

public int countStringContents(String s, String checklist)
{
StringBuffer sb = new StringBuffer(s);
StringBuffer checkable = new StringBuffer(checklist);
int retval = 0;
   for (int i = 0; i < sb.length(); i++)
    {
        char cTest = sb.charAt(i);
        for(int k = 0; k < checkable.length(); k++)
        {
        if (sb.charAt(i) == checkable.charAt(k))
        {
            retval ++;
            break;
        }
        }
    }
return retval;
}

All we’re really doing here is repping the case statement with a zippy little comparator loop operating on a Stringbuffer of characters we’re interested in – if the chars match it hops out of the loop and back up to the top of the outer loop. Let’s say we wanted to find all the capitalised Xs and Ys in a String in this example:

StringTools st = new StringTools();
String s = "Nothing but us three XXXs and two YYs";
String toCheck="XY";
int x = st.countStringContents(s, toCheck);
System.out.println(x);

This should return us 5. To check for vowels, we’d simply change the String toCheck value from “XY” to “aeiouAEIOU”.

Now let’s consider how we’d do this with the slightly more complex but still context-specific code in the earlier identifyVowels method.

public int[] locateStringContents(StringBuffer sb, StringBuffer sbCheckable)
    {  
    List l = new ArrayList();
    
    for (int i = 0; i < sb.length(); i++)
    {
        char cTest = sb.charAt(i);


        for(int k = 0; k < sbCheckable.length(); k++)
        {
        if (sb.charAt(i) == sbCheckable.charAt(k))
        {
            l.add(i); 
            break;
        }

        }    
    }
    return at.toIntArray(l); // see earlier posts for how this is implemented!
    }

We can test this as follows:

StringTools st = new StringTools();
    String s = "Nothing but us three XXXs and two YYs";
    String toCheck="XY";
    int x = st.countStringContents(s, toCheck);
    System.out.println(x);
    int[] xx = st.locateStringContents(new StringBuffer(s), new StringBuffer(toCheck));
        System.out.println(xx.length);
    for (int i = 0; i < xx.length; i++)
    {
        System.out.print(xx[i]);
        System.out.print(",");
    }

Categories: Java Tags: arrays, Java, refactoring, String Manipulation, StringBuffer

Counting vowels in Strings… the fast way…

February 5, 2010 syzygy Leave a comment

Came across an interesting little search in the dashboard of this blog the other day where someone was looking to count the number of vowels in a given String, or potentially an array of Strings. Doing this by direct String manipulation or regexping the values out could be a very time-consuming operation so I ran up a little piece of code to facilitate this (I can see some very practical applications for it for my own ends, so operating here from a position of enlightened self-interest!).

Anyway it’s straightforward enough to do via the StringBuffer:

public int vowelCounter(String s)
{
StringBuffer sb = new StringBuffer(s);
int retval = 0;
   for (int i = 0; i < sb.length(); i++)
    {
        char cTest = sb.charAt(i);
                
        
        switch (cTest) {
            case 'a':  retval++; break;
            case 'e':  retval++; break;
            case 'i': retval++; break;
            case 'o':  retval++; break;
            case 'u':  retval++; break;
            case 'A':  retval++; break;
            case 'E':  retval++; break;
            case 'I': retval++; break;
            case 'O':  retval++; break;
            case 'U':  retval++; break;
            default: ;
        }
       
    }
return retval;
 }

Obviously if it’s more than just one String, such as an array of Strings or other Collection type, you just have to have a counter variable, then iterate through the array or other collection type calling the vowelCounter method with the String to be dealt with at the time, incrementing the counter variable as necessary on each loop.

We can make this problem more interesting by extending it a little and asking ‘what if we need to know where the vowels are?’ Well, if you’ve looked at the post on the subject of cleaning whitespace, there’s a nice little StringBuffer prototype there for this already in the identifyWS method, which I’ll repeat here to save you scrolling down the page.

 private int[] identifyWS(StringBuffer sb)
    {
    List l = new ArrayList();
    for (int i = 0; i < sb.length(); i++)
    {
        if (sb.charAt(i) == ' ')
        {
        l.add(i);
        }
    }
    return at.toIntArray(l);
    }

The modfication required here to make this work is trivial, and really all we need to do is rename the method and substitute a slightly modfied version of the switch statement in vowelCount for the if condition:

 private int identifyVowels(StringBuffer sb)
    {
    List l = new ArrayList();
    for (int i = 0; i < sb.length(); i++)
    {
        char cTest = sb.charAt(i);
               
        
        switch (cTest) {
            case 'a':   l.add(i); break;
            case 'e':  l.add(i); break;            
            case 'i':  l.add(i); break;
            case 'o':   l.add(i); break;
            case 'u':   l.add(i); break;
            case 'A':   l.add(i); break;
            case 'E':   l.add(i); break;
            case 'I':  l.add(i); break;
            case 'O':   l.add(i); break;
            case 'U':   l.add(i); break;
            default: break;
        }
    }
    return toIntArray(l);
    }

And you’ll need the toIntArray method, also given earlier:


public int[] toIntArray(List<Integer> integerList) {
int[] intArray = new int[integerList.size()];
for (int i = 0; i < integerList.size(); i++) {
intArray[i] = integerList.get(i);
}
return intArray;
}

This will return you an array of positions of all the vowels in the passed String.

Categories: Java, Uncategorized Tags: arrays, Java, String Manipulation, StringBuffer

Implementing a fast method for cleaning arrays returned as Strings

February 1, 2010 syzygy Leave a comment

In the article of 30.1.10 on String Sorts I talked about a means of cleansing an array returned as a String by using for illustrative purposes one which had the somewhat iffy method name of “topAndTail” and which acts using Strings rather than a StringBuffer, e.g. so that when you return your array via .toString() you end up with output that looks like:

[ a, b, c, d, e ]

which you want back as plain-text

a b c d e

I mentioned that there was a more efficient way of doing this via a StringBuffer. Typically I would implement this StringBuffer-acting method as follows:

public String fastTopAndTail(String toDo)
{
StringBuffer sb = new StringBuffer(toDo);
sb.replace(0, 1, ""); // kill the first [
int k = sb.length();
sb.replace(k - 1, k, ""); // last [ killed
k = sb.length();
for (int i = 0; i < k; i++)
{
if(sb.charAt(i) == ',')
{
sb.replace(i,i + 1,"");
k = sb.length(); 
}
}
return sb.toString();
}

It’s doing exactly the same work as the earlier example topAndTail method but will run a lot faster under a heavy load since it isn’t making lots of copies of String objects. BTW, the call to keep the variable k up to date in the loop is redundant here & you’re welcome to rip it from your copy without expecting adverse effects but by default I like to keep a handle on the k value since many of my StringBuffer methods have a very similar structure and often the char is not being replaced but deleted or added to. If, however, this were production code it would go 🙂

Categories: Java Tags: arrays, Java, String Manipulation, String sorts, StringBuffer

Fun with String vectors… Part 1

September 13, 2009 syzygy Leave a comment

Working extensively with Strings inevitably entails working with arrays. Lots of them. In this article I will be looking at how some of the pain can be taking out of this in our putative replacement class for String.

First off, one thing which makes working with String arrays easier is to convert your array to a Vector, ArrayList or one of the other Collection classes; your mileage will vary based around a number of criteria, not the least being what you intend to do, whether you need to handle this in a threaded manner, etc, etc and due consideration of the right Collection class to utilise is important; I have opted for Vector for this illustration because it has broadly based utility.

Arrays inherently are messy to work with and serial abuse of them inevitably results in an unhealthy proliferation of for loops, often nested to the point of Lovecraftian insanity.

Let’s therefore pretend for the purposes of simplicity & clarity in this article that we only have the Vector class with which to work. First off we’ll need to add

import java.util.Vector;

to our class header.

On demand empty String Vector

Next we’ll need a method to create an instant on-demand String Vector method:

public  Vector<String> getPreparedVector()
{
Vector<String> vOutput = new Vector<String>();
return vOutput;
}

Converting String[] to a Vector

Next up we need to have a method to pass a String array to a Vector since most of the inbuilt String class methods we frequently need recourse to produce output as an array. This we can do as follows:

/*
* @param inArray - the array to convert to a Vector
* @return the populated Vector from incoming String array
*/
public  Vector getVectorFromStringArray(String[] inArray)
{
// since this will be the same length as s but parsed
Vector<String> v = getPreparedVector();
for (int i = 0; i < inArray.length; i++)
{
v.add(inArray&#91;i&#93;);
}
return v;
}&#91;/sourcecode&#93;

<strong>Vector to String[]</strong>

And of course we will likely need a method to reverse the process, i.e. to convert the Vector back to an array...

/**
* Will convert a passed vector to a string array  
* @param Vector v - the vector for conversion
* @return String array of the converted vector
*/
public String[] vectorToStringArray( Vector v )
{
int count = v.size();
String[] outArray = new String[count];
v.copyInto(outArray);
return outArray;
}

Vector is responsible for very little of the code in the above method, and you are probably seeing already why it is a better candidate for hardcore work than an array: it has functionality which obviates the need for intricate loops. Most of the effort here derives from having to accomodate the somewhat Byzantine complexities of the Array class.

Growing a vector

One immediate advantage which a Vector has over an array is that the number of elements which it contains can be increased or decreased dynamically. The API documentation for Vector makes this point explicitly:

The Vector class implements a growable array of objects. Like an array, it contains components that can be accessed using an integer index. However, the size of a Vector can grow or shrink as needed to accommodate adding and removing items after the Vector has been created.

Each vector tries to optimize storage management by maintaining a capacity and a capacityIncrement. The capacity is always at least as large as the vector size; it is usually larger because as components are added to the vector, the vector’s storage increases in chunks the size of capacityIncrement. An application can increase the capacity of a vector before inserting a large number of components; this reduces the amount of incremental reallocation.

Inevitably, we will want to add and remove elements from our Vector; a royal pain with an Array, but a moment’s work in a Vector…

Vector<String> v = getPreparedVector();
v.add("Hello");
v.add("World");
v.add("A superfluous element");
v.add("etc");
v.remove(3);

[..]

Categories: Java Tags: arrays, Java, Programming, String Manipulation, vectors

The Darker Side of Programming

Archive

HashMap key -> ArrayList

HashMaps and how to bend them to your will

List to Array conversion optimisation

Counting vowels… the faster AND dynamically reusable way…

Counting vowels in Strings… the fast way…

Implementing a fast method for cleaning arrays returned as Strings

Fun with String vectors… Part 1

Coding undreamed of by H.P. Lovecraft

Email Subscription

Archives

Meta