Archive

Posts Tagged ‘String Manipulation’

Improved method for removing duplicate white space

March 23, 2011 1 comment

On the principle that constant refactoring is a good thing, I revisited my method for removing duplicate white space from Strings / StringBuffers. The result was extremely positive, a much cleaner and more streamlined method.

private StringBuffer rmDuplicateWS(StringBuffer sb)
{
int currentPos = 0;
char ws = ' ';
// trim the leading whitespace

while(sb.charAt(0)  == ws)
{
sb.deleteCharAt(0);
}
// now get the trailing whitespace

while(sb.charAt(sb.length() - 1)  == ws)
{
sb.deleteCharAt(sb.length() - 1);
}
// loop until we reach the end, deleting duplicate ws instances

boolean chk = true;
while(chk)
{
if((sb.charAt(currentPos) == ws) && (sb.charAt(currentPos + 1) == ws) )
{sb.deleteCharAt(currentPos);}
else
{currentPos++;}
if(currentPos == sb.length() - 1)
{chk = false;} // exit
}

return sb;
}

Advertisements

Implementing a split (or a pseudo-split) for StringBuffers/Builders

March 15, 2011 Leave a comment

One of the functionalities regrettably absent from the StringBuilder/StringBuffer families is the inbuilt and nice String method split(String regexp) (q.v. java.lang.String.split()), which will produce a tokenised array of Strings based around and consuming the supplied regexp token. The cranky way of doing this with a Stringbuffer is to cast your lightweight StringBuffer to a String, split to an array of Strings, then cast the array of Strings back to a StringBuffer array or List, which to me looks somewhat like defeating the object of the entire exercise in working with StringBuffers. I have a marked preference for working with Lists as opposed to arrays but I do realise that there are those of the other faith who have valid reasons for their heretical idolatry (j/k) so I’ll provide methods for both outcomes.

Given that we have a method for providing a List of token positions from a supplied StringBuffer (q.v. Locating token positions[..]) (and I have a somewhat improved method for doing this which I will anyway supply as an appendage to this post – the refactored method has been renamed getTokenPositions as opposed to the earlier findTokens) the way is clear for us to implement the new split method.

/** Method to split an inbound StringBuffer by (consumed) tokens and produce a List 
* @param StringBuffer sbx - the StringBuffer to split
* @param StringBuffer sbTok - a StringBuffer representation of token(s) to use to split
* @return List of StringBuffers split out
*/
public List split(StringBuffer sbx, StringBuffer sbTok)
    {
    int tokSz = sbTok.length();
    List lix = new ArrayList();
    List lPos = getTokenPositions(sbx, sbTok );
    if( lPos.isEmpty() || lPos == null) // no split?  send the original sb back
    {
        lix.add(sbx);
        return lix;
    }

    int start = 0;
    if(lPos.get(0) == 0)
    {
    start += tokSz;
    }

    int iSz = lPos.size();

        for (int i = 0; i < iSz; i++) {
            StringBuffer sbnew = new StringBuffer();
        if(i + 1 == iSz)
        {
        sbnew = new StringBuffer(sbx.subSequence(start, sbx.length()));
        }
        else
        {
                sbnew = new StringBuffer(sbx.subSequence(start, lPos.get(i + 1)));
                start = lPos.get(i + 1) + tokSz;
            }
           // System.out.println(sbnew.toString());
            lix.add(sbnew);
        }

    return lix;
    }

To produce an Array of StringBuffers, you merely need to change the return method signature

public StringBuffer[] split(StringBuffer sbx, StringBuffer sbTok)

and modify the code where the returns occur (2 places) to read:

 return (StringBuffer[]) lix.toArray();

Modified method for providing a List of token positions

I mentioned earlier I had a somewhat improved version of the findTokens method. The code for this (+ the comparator-helper List construction methods) follows:

 public List getTokenPositions(StringBuffer sbx, StringBuffer tok )
    {
    List liTok = charListFromSb(tok);
    List liOut = new ArrayList();
    int sz = tok.length() - 1;
    int finish = sbx.length() - sz;
    char firstTok = tok.charAt(0);
    char lastTok = tok.charAt(sz);
        for (int i = 0; i < finish; i++) {
            if ( (sbx.charAt(i) == firstTok)   && (sbx.charAt(i + sz) == lastTok) )
            {
            List comp =  charListFromSb(sbx, i, i+ sz);
            if (comp.equals(liTok))
              {
                boolean add = liOut.add(i);
              }
            }
        }
    return liOut;
    }

 public List charListFromSb(StringBuffer sbx)
    {
        List liOut = new ArrayList();
        int iEnd = sbx.length();
        for (int i = 0; i < iEnd; i++) {
            boolean add = liOut.add(sbx.charAt(i));
        }

    return liOut;
    }
 public List<Character> charListFromSb(StringBuffer sbx, int start, int finish)
    {
       
        List<Character> liOut = new ArrayList<Character>();
        for (int i = start; i <= finish; i++) {
            boolean add = liOut.add(sbx.charAt(i));
        }
    return liOut;
    }

Extending and the improving the token location method

December 16, 2010 Leave a comment

I have been busily building up the methods which surround the token location method I outlined in yesterday’s post since I aim to build a robust StringBuffer/Builder helper class which is as flexible and intuitive as the locator methods expressed in the standard Java String class.

Obviously one thing I will want to do with the method outlined previously is to build a stripped down version which only iterates the String to search once for handling short token matches. And I’ll obviously need to determine the criteria for deciding under what circumstances to use which of the two eventual implementations of this method I arrive at. The NetBeans profiling tool is an obvious win here for examining assets such as memory and heap usage, and I have a nice little wrap-around timer utility which I can inject into my class for assessing relative speeds of differently constructed search parameters passed as arguments. I’ll have a look at it over the weekend and once that’s out of the way and the appropriate method is being called by a delegator method, I’ll optimise some of the syntax candy which I am already beginning to surround this method with. None of the methods are particularly pretty or built at this stage for speed, they’re built for facility of implementation and can be used as is pretty much out of the box.

The obvious initial methods to sugar up are ones which can use the generated List object e.g. the trivial (and inefficient) countTokens method & its overload which follows:


/** Syntax candy to count the number of incidences of a token in a given char sequence */

public int countTokens(StringBuffer sb, StringBuffer sbx)
    {
            List<Integer> l = findTokens(sb, sbx);
            return l.size();
     }

/** Syntax candy ph to count the number of incidences of a token (expressed as items) in a given List */

public int countTokens(List<Integer> l)
    {
            return l.size();
     }

Next up there’s a standard boolean check to see whether there are any matched tokens:


public boolean containsMatch(StringBuffer sb, StringBuffer sbx)
    {
    if (countTokens(sb, sbx) < 1 )
        {
        return false;
        }
    return true;
    }

OK that’s the rough & ready syntax candy out of the way, now let’s look at how we can leverage the information we have back in the List. Examining large strings (& this is particularly the case with large strings arriving from or in markup language formats such as HTML/XML etc)  it’s often the case that you need to know about the position of  either a single char relative to the token’s position, or alternatively another token altogether. The char based implementations for forward and reverse location are relatively simple. They both take the String to search, the offset character index point and the char which needs to be located as arguments, and look like this:


public int findPreviousMatch(StringBuffer sb, int startPt, char toLocate)
    {
    int loc = -1;
    int ctr = startPt;
    while (ctr >= 0)
    {
            ctr--;
            if (sb.charAt(ctr) == toLocate)
            {return ctr;}
    }
    return loc;
    }

public int findNextMatch(StringBuffer sb, int startPt, char toLocate)
    {
    int loc = -1;
    int ctr = startPt;
    int len = sb.length();
    while (ctr < len)
    {
            ctr++;
            if (sb.charAt(ctr) == toLocate)
            {return ctr;}
    }
    return loc;
    }

We need to do the same thing for tokens. Arguments are an int indicating the starting point to search around, the StringBuffer to search (sb) and the token to search for (sbx) expressed again as a StringBuffer.


public int findTokenPriorToOffset(int start, StringBuffer sb, StringBuffer sbx)
    {
    int loc = -1;
    if (start > sb.length())
    {start = sb.length();} // move start to be the sb.length() if start > the size of inc string

    int pos = start;
    List<Integer> l = findTokens(sb, sbx);
    if(l.size() == 0){return loc;} // no match

    // only 1 item and bigger than start? return -1
    if ((l.size() == 1) && ((Integer) l.get(1) > start) )
    {
       return loc;
    }
    // only 1 item and less than start? return token startpoint
    if ((l.size() == 1) && ((Integer) l.get(1) < start) )
    {
       return (Integer) l.get(1);
    }

    Iterator it = l.iterator();
    while(it.hasNext())
    {
    int val = (Integer) it.next();
    if (val > start){return loc;}
    if (val < start)
    {
        loc = val;
    }

    }

    return loc;
}
public int findTokenAfterOffset(int start, StringBuffer sb, StringBuffer sbx)
    {
    int loc = -1;
    if (start > sb.length())
    {        return  -1; } // won't be found.... ret -1

    int pos = start;
    List<Integer> l = findTokens(sb, sbx);
    if(l.size() == 0){       return loc; }  // no match

    // only 1 item and less than start? return -1
    if ((l.size() == 1) && ((Integer) l.get(1) < start) )
    {
       return loc;
    }
    // only 1 item and &gt; start? return token startpoint
    if ((l.size() == 1) && ((Integer) l.get(1) > start) )
    {
       return (Integer) l.get(1);
    }
    Iterator it = l.iterator();
   

 while(it.hasNext())
    {
    int val = (Integer) it.next();
       if (val > start){return val;}
    }
    // fallthrough
    return loc;
}

Locating token positions in StringBuffers

December 15, 2010 Leave a comment

This might be useful to someone along the way; I certainly have practical uses for it of my own for fast exact matching & location of long tokens in StringBuffers. It’s probably up for a certain amount of optimisation since it double scans the searchable base StringBuffer object, on the obverse side in certain situations this is probably more efficient than trying to do it all in one bite since it builds two List objects, a preliminary one which contains possible matches, and from this list of possible matches it then refines its search to provide a definitive List of exact matches – the longer the length of the search token the more efficient this ultimately is. The method could also be expanded or enhanced to do regex style pattern matching and case insensitive matching.

/**
* Method for searching the StringBuffer sb to identify the int locations of instances of the contents of StringBuffer sbx
*
* Returns a list of Integer positions of all occurrences of an exact match of the passed StringBuffer sbx in
*  StringBuffer sb
*/
public List<Integer> findTokens(StringBuffer sb, StringBuffer sbx)
    {
            int ctr = 0;
            int len = sb.length();
            int k = sbx.length();
            char tokenStart = sbx.charAt(0);
            char tokenEnd = sbx.charAt(k - 1);

            List possibles = new ArrayList();
            for (int i = 0; i < (len - (k - 1)); i++) {
            if((sb.charAt(i) == tokenStart) && (sb.charAt(i + (k - 1)) == tokenEnd))
            {
                possibles.add(i);
            }
            }

            List definites = new ArrayList();
            Iterator it = possibles.iterator();
            while (it.hasNext())
            {
            int start = (Integer) it.next();
            boolean OK = true;
            int tokCtr = 0;
            for (int i = start; i < start + (k - 1); i++) {
                if(sb.charAt(i) != sbx.charAt(tokCtr))
                {OK = false;} // probably ought to break/label here if you want to make it bleed (I don't, need the trace!)

                tokCtr++;
                }
               if(OK) // don't add if not ok!
               {
                    definites.add(start);
                }
            }
            return definites;
     }

Embedding Jython script in Java with ScriptEngine…

March 3, 2010 1 comment

In an idle hour I did some more playing about with the Java ScriptEngine (see the earlier post on !Reinventing the Wheel) and decided to see just how easy it was to introduce another ScriptEngine into the mix. It was a toss-up between JRuby and Jython, but in the balance I went with Jython because it has certain benefits to work I frequently get engaged in relating to middleware, although I am altogether less familiar and comfortable with Jython/Python than JRuby/Ruby.

It was spectacularly easy, in fact depressingly so, not a technical challenge in sight. Download and classpath jar. Register Jython with ScriptEngine, run up a JUnit to test it’s available etc recycling the stringEval method outlined in the earlier post to test POC.

OK next up we want Java to be able to execute a Jython script proper. Here’s a simple Jython script to get the current directory (scarfed & modified from WebsphereTools):

import sys
import java.io as jio
currentdir = javaio.File (".")
print "Current directory : " + currentdir.getCanonicalPath()

Worked first time in standalone mode from Jython itself.

Next up is to see whether we can make it execute this script when embedded as a String in Java. Again depressingly easy.

static final private ScriptEngineManager sm = new ScriptEngineManager();
static final private ScriptEngine sEngine = sm.getEngineByName("jython");
static final String lsep = System.getProperty("line.separator");

public static void main(String[] args)
{
String a = "import sys" + lsep;
        a += "import java.io as javaio" + lsep;
        a+= "currentdir = javaio.File (\".\")" + lsep;
        a+= "print \"Current directory : \" + currentdir.getCanonicalPath()" + lsep;
       Object anon = runScript(a, "jython");
}

 public static Object runScript(String script, String engine) {
        Object res = "";
        if (engine.equals("jython")) {
            try {
                res = sEngine.eval(script);
            } catch (ScriptException ex) {
                Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
return res;
}

Since I’m using NetBeans (can I say how much I love NetBeans again?) the answer this will print for me is:

Current directory : C:\Users\XXX\Documents\NetBeansProjects\TestEngineFactory (where XXX = my user base dir, names changed to protect the innocent for all the fairly obvious reasons).

The only real gotchas when doing something like this are to remember to ensure quotes are escaped correctly, that lines have a line separator and that you don’t trim or otherwise format incoming Jython script Strings when reading existing Jython scripts since Jython (being a Python superset) has those dull indentation rules etc.

Think I’ll stop blogging now and see how easy it is to do the same job with JRuby to access my arsenal of Ruby classes and mixins. I’ll let you know how I got on with an update later in the week.

HashMap->ArrayList breakdown and reconstitution of documents in Java

March 2, 2010 Leave a comment

Given the general utility of HashMaps with ArrayLists as values, as illustrated in the earlier post HashMap key -> ArrayList, it’s fairly obvious that we can in most cases (actually all, but I’m open to persuasion that somewhere an exception may exist) represent a document, such as a short story by Lord Dunsany, a play by William Shakespeare, an XML or Soap document, in fact any document in which repetitious tokens occur or are likely to occur, as a set of keys indicating the tokens, and an arraylist of their relative positions in the document.

Let’s as an example take a block of text such as that found in Churchill’s speech of November 10, 1942 in the wake of Alexander and Montgomery repelling Rommel’s thrust at Alexandria: “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” It’s short, sweet, and contains the criteria we need for a good example, notably multiple repetitive tokens. For the purposes of this examination we’ll totally neuter the punctuation and capitalisation so that we just have tokens:now this is not the end it is not even the beginning of the end but it is perhaps the end of the beginning. At this juncture it is still recognisable as Churchill’s text. Let’s turn this into a reproducible hashmap in which each unique word is a key to a value set which is an ArrayList containing instances of the positions of the word in the text which we supplied it with. I won’t bore you with repeating the addArrayListItem method (it’s in the post immediately below this one…) and the printOutHashMap method is somewhat self-evident but included for compeleteness.

public void someBoringMethodSig()
{
String inString = "now this is not the end it is not even the beginning of the end ";
inString += "but it is perhaps the end of the beginning";

HashMap ha = new HashMap();

String[] tok = inString.split(" ");
int k = tok.length;

 for (int i = 0; i< k; i++)
        {
        addArrayListItem(ha,tok[i],i );
        }

    printOutHashMap(ha);
   // assemble(ha);
    }
public void printOutHashMap(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();
Iterator itr = c.iterator();
Iterator wd = b.iterator();
    System.out.println("HASHMAP CONTENTS");
    System.out.println("================");
while(itr.hasNext()){
System.out.print("Key: >>" + wd.next() + "<< Values: " );//+ itr.next()); }
    ArrayList x = (ArrayList) itr.next();
    Iterator lit = x.iterator();
    while (lit.hasNext())
    {System.out.print(lit.next() + " ");

    }
    System.out.println("");
}

}

This will give us an output which looks very much like

HASHMAP CONTENTS
================
Key: >>not<< Values: 3 8
Key: >>of<< Values: 12 21
Key: >>but<< Values: 15
Key: >>is<< Values: 2 7 17
Key: >>beginning<< Values: 11 23
Key: >>it<< Values: 6 16
Key: >>now<< Values: 0
Key: >>even<< Values: 9
Key: >>the<< Values: 4 10 13 19 22
Key: >>perhaps<< Values: 18
Key: >>this<< Values: 1
Key: >>end<< Values: 5 14 20

We can obviously reassemble this by recourse to the values, either in forward (starting at 0) or in reverse order (finding the max int value) and reassembling in a decremental loop. The implications of this are significant when one considers the potential applications of this sort of document treatment from a number of perspectives, notably IT security, improved data transmissibility by reduction, cryptography (where one would not transmit the keys but a serializable numeric indicator to a separately transmitted (or held) common dictionary, etc, etc.

Generally the reassembly process looks something like the following :


public void assemble(HashMap hamp)
{
Collection c = hamp.values();
Collection b = hamp.keySet();

int posCtr = 0;

int max = 0;

Iterator xAll = hamp.values().iterator();
while(xAll.hasNext()){
  ArrayList xxx = (ArrayList) xAll.next();
  max += xxx.size();
  }

String stOut = "";
boolean unfinished = true;

while (unfinished)
{
Iterator itr = c.iterator();
Iterator wd = b.iterator();
start:
while(unfinished){

String tmp = (String) wd.next() ; // next key
ArrayList x = (ArrayList) itr.next(); // next ArrayList
Iterator lit = x.iterator(); // next ArrayList iterator

while (lit.hasNext())
{
//System.out.println(lit.next());
if((Integer.valueOf((Integer) lit.next()) == posCtr )) // if it matches the positional counter then this is the word we want...
{
stOut = stOut + " " + tmp ; // add the word in tmp to the output String
posCtr++;

wd = b.iterator(); //don't forget to reset the iterators to the beginning!
itr = c.iterator();
if(posCtr == max) // last word to deal with....
{
System.out.println("FINISHED");
unfinished = false;}
break start; // otherwise goto the start point, rinse, repeat (irl this would be handled more elegantly, illustrative code here!)

}

}

}
}
System.out.println("ASSEMBLED: " + stOut);
}

This can obviously be improved in a number of ways, particularly moving from a String-based paradigm to one in which char[] array representations replace Strings in the HashMap.

HashMap key -> ArrayList

March 1, 2010 Leave a comment

In the earlier and gentle introduction article on HashMaps, HashMaps and how to bend them to your will, I looked at how we might use a HashMap to do vocabulary count assessment of a passed String document. HashMaps are, as we have seen, ostensibly dumb key-value sets. I say ostensibly: the given key or value does not however have to be a single item as HashMap effects storage in both value and key as Object types. From a programmatic perspective this realisation has significant implications.

Let’s look at how we might use a keyed HashMap paired with values expressed as ArrayLists. The following methods will give you an idea of how to put int values in per key, the first method using a single value, the second handling an array of values. Obviously these methods are ripe for extension, exception handling and overloading but they are stripped down here from an illustrative point of view:


public HashMap<String, ArrayList> addArrayListItem(HashMap<String, ArrayList> hamp,String sKey,int value )
{
    if(!hamp.containsKey(sKey)) // no key found?
    {
    ArrayList lx = new ArrayList();
    lx.add(value);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the (new) value to the existing arraylist
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.add(value);
    hamp.put(sKey, lx);
    }
    return hamp;
}

public HashMap<String, ArrayList> addArrayListItems(HashMap<String, ArrayList> hamp,String sKey,int[] value )
{
    int iSize = value.length;
    ArrayList iList = (ArrayList) Arrays.asList(value);

    if(!hamp.containsKey(sKey)) // no key found? // create key and add list
    {
    ArrayList lx = new ArrayList();
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }
    else // the key pre-exists so we just add the array of values to the existing list
    {
    ArrayList lx = hamp.get(sKey)    ;
    lx.addAll(iList);
    hamp.put(sKey, lx);
    }

    return hamp;
}

The getter method to retrieve by key is even easier (although again this method is open to extension, overloading, exception handling etc):

public ArrayList<int[]> getItems(HashMap<String, ArrayList> hamp,String sKey)
{
return hamp.get(sKey);
}

We can moreover use an ArrayList in an arbitrary, or, indeed, in a very carefully structured, fashion, e.g. to store not only integers but Strings, booleans, whatever, in order to achieve our purpose. Again, an untyped ArrayList is Object type agnostic as the following snippet of dummy code illustrates.

        ArrayList al = new ArrayList();
        int i = 1;
        String s = "Monty Python";
        boolean isVerySilly = true;
        al.add(i);
        al.add(s);
        al.add(isVerySilly);
        Iterator it = al.iterator();
        while (it.hasNext())
        {
            System.out.println(it.next());
        }

The realisation of this has similarly significant implications for the intelligent programmer. Exploiting the HashMap/ArrayList (or similar collection type) combination allows you to do some very powerful things with the data coming your way.