Archive

Posts Tagged ‘XML’

Working with XPath

February 16, 2011 1 comment

It has always struck me that XPath is a nice tool for XML forensics. It can be exposed very easily and quickly, and, from a programmatic point of view, can be used to open up and expose the intricacies of often very complex xml documents with a minimum of effort.

Here’s a (slightly simplified refactored and cleaned) version of my basic setup class for examining documents with XPath.

import java.io.IOException;
import java.io.InputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.namespace.QName; //not actually used in this vn. but can be handy to have around...
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XPathBase {

private DocumentBuilderFactory domFactory;
private DocumentBuilder builder;
private Document doc;
private String xmlFile;
private XPath xPath;
private String resourceRoot = "";
private InputStream inputStream ;
private javax.xml.xpath.XPathExpression expression;
private DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

/**
* Constructor takes 2 args, file to examine, and the resource location
**/
public XPathBase(String xFile, String resRoot) {
resourceRoot = resRoot;
xFile = resourceRoot + xFile;
this.xmlFile = xFile;
setDomObjects();
}

public InputStream getAsStream(String file)
{
inputStream = this.getClass().getResourceAsStream(file);
return inputStream;
}

public void setDomObjects()
{
try {
domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
builder = domFactory.newDocumentBuilder();
doc = builder.parse(getAsStream(xmlFile));
xPath = XPathFactory.newInstance().newXPath();
} catch (SAXException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
} catch (ParserConfigurationException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
}

}

public Document getDoc() {
return doc;
}

public InputStream getInputStream() {
return inputStream;
}

public XPath getxPath() {
return xPath;
}

public String getXmlFile() {
return xmlFile;
}

// [..] Getters & setters for other objects declared private may follow (i.e. add what you need although typically you will only need getters for the XPath, the Document, and the InputStream)

// Not really part of this class and would be (ordinarily) implemented elsewhere
public void readXPath(String evalStr)
{
try {
XPathExpression expr = xPath.compile(evalStr);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;

for (int i = 0; i < nodes.getLength(); i++) {
if(nodes.item(i).getNodeValue() != null)
{
 System.out.println(nodes.item(i).getNodeValue());
}
}

} catch (XPathExpressionException ex) {
Logger.getLogger(XPather.class.getName()).log(Level.SEVERE, null, ex);
}

}

}

As I mentioned in the source code the readXPath method at the tail is not really a part of this class and is provided for illustrative purposes and will allow us to quickly begin to get under the bonnet.

Let’s set up a piece of trivial xml for examination

<?xml version="1.0" encoding="UTF-8"?>

<root>
<people>
<person ptype = "author" century = "16th">William Shakespeare</person>
<person>Bill Smith</person>
<person ptype = "jockey">A P McCoy</person>
</people>
</root>

Assuming you had a file called test.xml in a resources folder you would action this as follows:


XPathBase xpb = new XPathBase("test.xml","resources/");
xpb.readXPath("//people/person/*/text()");

The readXPath method really belongs in another class which is more concerned with the handling of XPath expressions and manipulation than the nuts and bolts of xml document setup and preparation. The following code is an example of how this might begin to look (it won’t look anything like this in the long run, but it will give you an idea of how refactoring and reshaping a class can pay big dividends).


import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import org.w3c.dom.NodeList;

public class XPathAnalyser {
    private String expression = "";
    private String file = "test.xml";
    private String resourceRoot = "resources/";
    private XPathBase xpb;
    private XPathExpression expr;

    public XPathAnalyser(String expr) {
        expression = expr;
        xpb = new XPathBase(file,resourceRoot);
    }
    public XPathAnalyser(String expr, String xFile, String resRoot) {
        expression = expr;
        file = xFile;
        resourceRoot = resRoot;
        xpb = new XPathBase(file,resourceRoot);
    }

public void readXPath(String evalStr)
    {
        try {
            XPathExpression expr = xpb.getxPath().compile(evalStr);
            Object result = expr.evaluate(xpb.getDoc(), XPathConstants.NODESET);
            NodeList nodes = (NodeList) result;

            for (int i = 0; i < nodes.getLength(); i++) {
                if(nodes.item(i).getNodeValue() != null)
                {
                         System.out.println(nodes.item(i).getNodeValue());
                }
            }

        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathBase.class.getName()).log(Level.SEVERE, null, ex);
        }

	}

}

This will work but it has a few drawbacks, notably that the class is dependant on the previous implementation of XPathBase, and strong dependencies are a bad thing. A better implementation would take the XPath setup class in as a class object in its own right and introspect accordingly. This would allow fully separation of context from implementation, use different document models, etc. We’ll live with this limitation for the moment while we begin to construct the XPathAnalyser in a more decomposed and useful shape. Most of the refactoring is about moving to a more rigorous setter/getter paradigm.

We can improve even as basic a method as readXPath with a bit of root and branch surgery. Making the compiled expression and the resultant NodeList private class variables gives us a lot more traction on the problem. Note the overloaded init method which is invoked without args in the constructor and has an overload which allows a fresh expression to be safely supplied.

public class XPathAnalyser {
    private String expression = "";
    private String file = "test.xml";
    private String resourceRoot = "resources/";
    private XPathBase xpb;
    private XPathExpression expr;
    private NodeList nodes;

    public XPathAnalyser(String expr) {
        expression = expr;
        xpb = new XPathBase(file,resourceRoot);
        init();
    }
    public XPathAnalyser(String exprStr, String xFile, String resRoot) {
        expression = exprStr;
        file = xFile;
        resourceRoot = resRoot;
        xpb = new XPathBase(file,resourceRoot);
        init();
    }

    public void init()
    {
          setExpression();
            setNodeList();
}
    public void init(String exprString)
    {
    expression = exprString;
    init();
    }

    public void setExpression()
    {
        try {
            expr = xpb.getxPath().compile(expression);
        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathAnalyser.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

public void setNodeList()
    {
        try {
            nodes = (NodeList) expr.evaluate(xpb.getDoc(), XPathConstants.NODESET);
        } catch (XPathExpressionException ex) {
            Logger.getLogger(XPathAnalyser.class.getName()).log(Level.SEVERE, null, ex);
        }
}

public void readXPath()
    {
        try {
      
            for (int i = 0; i < nodes.getLength(); i++) {
                if(nodes.item(i).getNodeValue() != null)
                {
                System.out.println(nodes.item(i).getNodeValue());
                }
            }

        } catch (Exception ex) {
            Logger.getLogger(XPathBase.class.getName()).log(Level.SEVERE, null, ex);
        }

	}

    public int getNodeListSize() {
        return nodes.getLength();
    }

    public NodeList getNodeList() {
        return nodes;
    }

}

The simple getters getNodeList() & getNodeListSize are useful since we are now in a position to work with the object in a much more amenable fashion. We can dig a little deeper by adding a reader to the class to examine attributes.

public boolean containsAttributes()
{
    int k = getNodeListSize();
    for (int i = 0; i < k; i++) {
    if( nodes.item(i).hasAttributes())
    {
     return true;
    }
return false;
}

public void readAttributes()
{
if (!containsAttributes())
{return;}

    for (int i = 0; i < getNodeListSize(); i++) {
        NamedNodeMap nnm =    nodes.item(i).getAttributes();
            for (int j = 0; j < nnm.getLength(); j++) {
                 //  System.out.println(nnm.item(j).getLocalName());
                 String attr = nnm.item(j).getLocalName();
                 System.out.println(nnm.getNamedItem(att));
            }
    }
}

This can obviously be extended and modified as necessary but the crux of the matter is that once you can construct the XPath and produce a viable nodelist, there’s very little you can’t do in the way of parsing and dissecting an XML document.

Advertisements
Categories: Java Tags: , , ,

Implementing a simple stream based XML Parser

October 18, 2009 Leave a comment

The Wiktionary XML dump -> db is virtually sorted, and was fairly trivial to code, albeit comparatively slow to process (3 mins +/- on my naff laptop to get a full list of available titles, for example, although this also includes a lot of metapages e.g. categories etc). Setup was straightforward enough, one block of code to create the InputStream & instance the methods class, and then the methods class itself.

try {
 InputStream in = new FileInputStream("F:/wiktionary/enwiktionary/wiktionary.xml");
 XMLInputFactory factory = XMLInputFactory.newInstance();
 XMLStreamReader parser = factory.createXMLStreamReader(in);
// instance the custom xml examiner class which contains whatever methods needed...
 XMLStreamMethods xsm = new XMLStreamMethods();
// call required methods
 xsm.examine(parser);
}
catch(Exception e){e.printStackTrace();}

The XMLStreamMethods class (which probably needs a better name!) was largely scarfed and adapted from code from XML.com. I’ll probably get discursive about the methods once this is closer to a decent cut, but effectively I’m really only interested in title and text tag contents, and most of the processing work is in grabbing the IPA renditions from the volumes of surrounding text fluff. Suffice it all that was really required from the example xml.com source was to hack the tag names from headers to title and text and then call workers accordingly.

Categories: Java Tags: , , ,

Wiktionary dump XML content to schema in > 2 minutes

October 3, 2009 Leave a comment

I am still playing with Wiktionary IPA as a source for my implementation of TTS. Conrad Irwin suggested the dump download from Wiktionary might be more appropriate & complete than spidering on demand and building a local map that way, and I could not but agree with him, the Wiki dumps had completely slipped my mind when wrangling with the miscellaneous other little subtasks that this project seems to be engendering.

Anyway, with the downloaded XML weighing in at 120MB+ it looked like this was going to be a nightmare, then I remembered James Clark’s trang, a tool I haven’t used in a long time, just unzip and classpath the executable jar and run as follows:

$ java -jar trang.jar -I xml -O xsd wiktionary_280909.xml /schemas/wiktionary.xsd

& a couple of minutes later, viola (sic).

The output schema looks like this in case you ever need it:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:export-0.3="http://www.mediawiki.org/xml/export-0.3/">
  <xs:import namespace="http://www.w3.org/2001/XMLSchema-instance" schemaLocation="xsi.xsd"/>
  <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>
  <xs:element name="mediawiki">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:siteinfo"/>
        <xs:element maxOccurs="unbounded" ref="export-0.3:page"/>
      </xs:sequence>
      <xs:attribute name="version" use="required" type="xs:decimal"/>
      <xs:attribute ref="xsi:schemaLocation" use="required"/>
      <xs:attribute ref="xml:lang" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="siteinfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:sitename"/>
        <xs:element ref="export-0.3:base"/>
        <xs:element ref="export-0.3:generator"/>
        <xs:element ref="export-0.3:case"/>
        <xs:element ref="export-0.3:namespaces"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="sitename" type="xs:NCName"/>
  <xs:element name="base" type="xs:anyURI"/>
  <xs:element name="generator" type="xs:string"/>
  <xs:element name="case" type="xs:NCName"/>
  <xs:element name="namespaces">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="export-0.3:namespace"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="namespace">
    <xs:complexType mixed="true">
      <xs:attribute name="key" use="required" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="page">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:title"/>
        <xs:element ref="export-0.3:id"/>
        <xs:element minOccurs="0" ref="export-0.3:redirect"/>
        <xs:element minOccurs="0" ref="export-0.3:restrictions"/>
        <xs:element ref="export-0.3:revision"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="redirect">
    <xs:complexType/>
  </xs:element>
  <xs:element name="restrictions" type="xs:string"/>
  <xs:element name="revision">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:id"/>
        <xs:element ref="export-0.3:timestamp"/>
        <xs:element ref="export-0.3:contributor"/>
        <xs:element minOccurs="0" ref="export-0.3:minor"/>
        <xs:element minOccurs="0" ref="export-0.3:comment"/>
        <xs:element ref="export-0.3:text"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="timestamp" type="xs:NMTOKEN"/>
  <xs:element name="contributor">
    <xs:complexType>
      <xs:choice>
        <xs:element ref="export-0.3:ip"/>
        <xs:sequence>
          <xs:element ref="export-0.3:username"/>
          <xs:element ref="export-0.3:id"/>
        </xs:sequence>
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:element name="ip" type="xs:string"/>
  <xs:element name="username" type="xs:string"/>
  <xs:element name="minor">
    <xs:complexType/>
  </xs:element>
  <xs:element name="comment" type="xs:string"/>
  <xs:element name="text">
    <xs:complexType mixed="true">
      <xs:attribute ref="xml:space" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
</xs:schema>

That gives me enough to work with, now to tear into the XML with JAXB…..

NB (Later). Well I played with JAXB, but unfortunately there are a number of facets of JAXB which will make this less amenable than XMLBeans, notably the issues surrounding preservation of whitespace & full support for all schema constructs. Close, but unfortunately no cigar on this occasion. So I am now working up an XMLBeans cut, and this is looking promising.

NNB(Later again) Well it looks like XmlBeans won’t play nice with large files, which is a shame because there is a lot of stuff in there that I could have done with. Off to play with STAX /cry….

Categories: Java Tags: , , , , , ,