Home > Java > Implementing a simple stream based XML Parser

Implementing a simple stream based XML Parser

The Wiktionary XML dump -> db is virtually sorted, and was fairly trivial to code, albeit comparatively slow to process (3 mins +/- on my naff laptop to get a full list of available titles, for example, although this also includes a lot of metapages e.g. categories etc). Setup was straightforward enough, one block of code to create the InputStream & instance the methods class, and then the methods class itself.

try {
 InputStream in = new FileInputStream("F:/wiktionary/enwiktionary/wiktionary.xml");
 XMLInputFactory factory = XMLInputFactory.newInstance();
 XMLStreamReader parser = factory.createXMLStreamReader(in);
// instance the custom xml examiner class which contains whatever methods needed...
 XMLStreamMethods xsm = new XMLStreamMethods();
// call required methods
 xsm.examine(parser);
}
catch(Exception e){e.printStackTrace();}

The XMLStreamMethods class (which probably needs a better name!) was largely scarfed and adapted from code from XML.com. I’ll probably get discursive about the methods once this is closer to a decent cut, but effectively I’m really only interested in title and text tag contents, and most of the processing work is in grabbing the IPA renditions from the volumes of surrounding text fluff. Suffice it all that was really required from the example xml.com source was to hack the tag names from headers to title and text and then call workers accordingly.

Advertisements
Categories: Java Tags: , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: