Home > Java > Wiktionary dump XML content to schema in > 2 minutes

Wiktionary dump XML content to schema in > 2 minutes

I am still playing with Wiktionary IPA as a source for my implementation of TTS. Conrad Irwin suggested the dump download from Wiktionary might be more appropriate & complete than spidering on demand and building a local map that way, and I could not but agree with him, the Wiki dumps had completely slipped my mind when wrangling with the miscellaneous other little subtasks that this project seems to be engendering.

Anyway, with the downloaded XML weighing in at 120MB+ it looked like this was going to be a nightmare, then I remembered James Clark’s trang, a tool I haven’t used in a long time, just unzip and classpath the executable jar and run as follows:

$ java -jar trang.jar -I xml -O xsd wiktionary_280909.xml /schemas/wiktionary.xsd

& a couple of minutes later, viola (sic).

The output schema looks like this in case you ever need it:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:export-0.3="http://www.mediawiki.org/xml/export-0.3/">
  <xs:import namespace="http://www.w3.org/2001/XMLSchema-instance" schemaLocation="xsi.xsd"/>
  <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>
  <xs:element name="mediawiki">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:siteinfo"/>
        <xs:element maxOccurs="unbounded" ref="export-0.3:page"/>
      </xs:sequence>
      <xs:attribute name="version" use="required" type="xs:decimal"/>
      <xs:attribute ref="xsi:schemaLocation" use="required"/>
      <xs:attribute ref="xml:lang" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="siteinfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:sitename"/>
        <xs:element ref="export-0.3:base"/>
        <xs:element ref="export-0.3:generator"/>
        <xs:element ref="export-0.3:case"/>
        <xs:element ref="export-0.3:namespaces"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="sitename" type="xs:NCName"/>
  <xs:element name="base" type="xs:anyURI"/>
  <xs:element name="generator" type="xs:string"/>
  <xs:element name="case" type="xs:NCName"/>
  <xs:element name="namespaces">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="export-0.3:namespace"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="namespace">
    <xs:complexType mixed="true">
      <xs:attribute name="key" use="required" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="page">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:title"/>
        <xs:element ref="export-0.3:id"/>
        <xs:element minOccurs="0" ref="export-0.3:redirect"/>
        <xs:element minOccurs="0" ref="export-0.3:restrictions"/>
        <xs:element ref="export-0.3:revision"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="redirect">
    <xs:complexType/>
  </xs:element>
  <xs:element name="restrictions" type="xs:string"/>
  <xs:element name="revision">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:id"/>
        <xs:element ref="export-0.3:timestamp"/>
        <xs:element ref="export-0.3:contributor"/>
        <xs:element minOccurs="0" ref="export-0.3:minor"/>
        <xs:element minOccurs="0" ref="export-0.3:comment"/>
        <xs:element ref="export-0.3:text"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="timestamp" type="xs:NMTOKEN"/>
  <xs:element name="contributor">
    <xs:complexType>
      <xs:choice>
        <xs:element ref="export-0.3:ip"/>
        <xs:sequence>
          <xs:element ref="export-0.3:username"/>
          <xs:element ref="export-0.3:id"/>
        </xs:sequence>
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:element name="ip" type="xs:string"/>
  <xs:element name="username" type="xs:string"/>
  <xs:element name="minor">
    <xs:complexType/>
  </xs:element>
  <xs:element name="comment" type="xs:string"/>
  <xs:element name="text">
    <xs:complexType mixed="true">
      <xs:attribute ref="xml:space" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
</xs:schema>

That gives me enough to work with, now to tear into the XML with JAXB…..

NB (Later). Well I played with JAXB, but unfortunately there are a number of facets of JAXB which will make this less amenable than XMLBeans, notably the issues surrounding preservation of whitespace & full support for all schema constructs. Close, but unfortunately no cigar on this occasion. So I am now working up an XMLBeans cut, and this is looking promising.

NNB(Later again) Well it looks like XmlBeans won’t play nice with large files, which is a shame because there is a lot of stuff in there that I could have done with. Off to play with STAX /cry….

Advertisements
Categories: Java Tags: , , , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: