Archive

Posts Tagged ‘Document assembly’

HashMap->ArrayList breakdown and reconstitution of documents in Java

March 2, 2010 Leave a comment

Given the general utility of HashMaps with ArrayLists as values, as illustrated in the earlier post HashMap key -> ArrayList, it’s fairly obvious that we can in most cases (actually all, but I’m open to persuasion that somewhere an exception may exist) represent a document, such as a short story by Lord Dunsany, a play by William Shakespeare, an XML or Soap document, in fact any document in which repetitious tokens occur or are likely to occur, as a set of keys indicating the tokens, and an arraylist of their relative positions in the document.

Let’s as an example take a block of text such as that found in Churchill’s speech of November 10, 1942 in the wake of Alexander and Montgomery repelling Rommel’s thrust at Alexandria: “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” It’s short, sweet, and contains the criteria we need for a good example, notably multiple repetitive tokens. For the purposes of this examination we’ll totally neuter the punctuation and capitalisation so that we just have tokens:now this is not the end it is not even the beginning of the end but it is perhaps the end of the beginning. At this juncture it is still recognisable as Churchill’s text. Let’s turn this into a reproducible hashmap in which each unique word is a key to a value set which is an ArrayList containing instances of the positions of the word in the text which we supplied it with. I won’t bore you with repeating the addArrayListItem method (it’s in the post immediately below this one…) and the printOutHashMap method is somewhat self-evident but included for compeleteness.

public void someBoringMethodSig()
{
String inString = "now this is not the end it is not even the beginning of the end ";
inString += "but it is perhaps the end of the beginning";

HashMap ha = new HashMap();

String[] tok = inString.split(" ");
int k = tok.length;

 for (int i = 0; i< k; i++)
        {
        addArrayListItem(ha,tok[i],i );
        }

    printOutHashMap(ha);
   // assemble(ha);
    }
public void printOutHashMap(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();
Iterator itr = c.iterator();
Iterator wd = b.iterator();
    System.out.println("HASHMAP CONTENTS");
    System.out.println("================");
while(itr.hasNext()){
System.out.print("Key: >>" + wd.next() + "<< Values: " );//+ itr.next()); }
    ArrayList x = (ArrayList) itr.next();
    Iterator lit = x.iterator();
    while (lit.hasNext())
    {System.out.print(lit.next() + " ");

    }
    System.out.println("");
}

}

This will give us an output which looks very much like

HASHMAP CONTENTS
================
Key: >>not<< Values: 3 8
Key: >>of<< Values: 12 21
Key: >>but<< Values: 15
Key: >>is<< Values: 2 7 17
Key: >>beginning<< Values: 11 23
Key: >>it<< Values: 6 16
Key: >>now<< Values: 0
Key: >>even<< Values: 9
Key: >>the<< Values: 4 10 13 19 22
Key: >>perhaps<< Values: 18
Key: >>this<< Values: 1
Key: >>end<< Values: 5 14 20

We can obviously reassemble this by recourse to the values, either in forward (starting at 0) or in reverse order (finding the max int value) and reassembling in a decremental loop. The implications of this are significant when one considers the potential applications of this sort of document treatment from a number of perspectives, notably IT security, improved data transmissibility by reduction, cryptography (where one would not transmit the keys but a serializable numeric indicator to a separately transmitted (or held) common dictionary, etc, etc.

Generally the reassembly process looks something like the following :


public void assemble(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();

int posCtr = 0;

int max = 0;

Iterator xAll = hamp.values().iterator();
while(xAll.hasNext()){
  ArrayList xxx = (ArrayList) xAll.next();
  max += xxx.size();
  }

String stOut = "";
boolean unfinished = true;

while (unfinished)
{
Iterator itr = c.iterator();
Iterator wd = b.iterator();
start:
while(unfinished){

String tmp = (String) wd.next() ; // next key
ArrayList x = (ArrayList) itr.next(); // next ArrayList
Iterator lit = x.iterator(); // next ArrayList iterator

while (lit.hasNext())
{
//System.out.println(lit.next());
if((Integer.valueOf((Integer) lit.next()) == posCtr )) // if it matches the positional counter then this is the word we want...
{
stOut = stOut + " " + tmp ; // add the word in tmp to the output String
posCtr++;

wd = b.iterator(); //don't forget to reset the iterators to the beginning!
itr = c.iterator();
if(posCtr == max) // last word to deal with....
{
System.out.println("FINISHED");
unfinished = false;}
break start; // otherwise goto the start point, rinse, repeat (irl this would be handled more elegantly, illustrative code here!)

}

}

}
}
System.out.println("ASSEMBLED: " + stOut);
}

This can obviously be improved in a number of ways, particularly moving from a String-based paradigm to one in which char[] array representations replace Strings in the HashMap.