Archive

Posts Tagged ‘Wikimedia’

Wiktionary dump XML content to schema in > 2 minutes

October 3, 2009 Leave a comment

I am still playing with Wiktionary IPA as a source for my implementation of TTS. Conrad Irwin suggested the dump download from Wiktionary might be more appropriate & complete than spidering on demand and building a local map that way, and I could not but agree with him, the Wiki dumps had completely slipped my mind when wrangling with the miscellaneous other little subtasks that this project seems to be engendering.

Anyway, with the downloaded XML weighing in at 120MB+ it looked like this was going to be a nightmare, then I remembered James Clark’s trang, a tool I haven’t used in a long time, just unzip and classpath the executable jar and run as follows:

$ java -jar trang.jar -I xml -O xsd wiktionary_280909.xml /schemas/wiktionary.xsd

& a couple of minutes later, viola (sic).

The output schema looks like this in case you ever need it:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:export-0.3="http://www.mediawiki.org/xml/export-0.3/">
  <xs:import namespace="http://www.w3.org/2001/XMLSchema-instance" schemaLocation="xsi.xsd"/>
  <xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="xml.xsd"/>
  <xs:element name="mediawiki">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:siteinfo"/>
        <xs:element maxOccurs="unbounded" ref="export-0.3:page"/>
      </xs:sequence>
      <xs:attribute name="version" use="required" type="xs:decimal"/>
      <xs:attribute ref="xsi:schemaLocation" use="required"/>
      <xs:attribute ref="xml:lang" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="siteinfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:sitename"/>
        <xs:element ref="export-0.3:base"/>
        <xs:element ref="export-0.3:generator"/>
        <xs:element ref="export-0.3:case"/>
        <xs:element ref="export-0.3:namespaces"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="sitename" type="xs:NCName"/>
  <xs:element name="base" type="xs:anyURI"/>
  <xs:element name="generator" type="xs:string"/>
  <xs:element name="case" type="xs:NCName"/>
  <xs:element name="namespaces">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="export-0.3:namespace"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="namespace">
    <xs:complexType mixed="true">
      <xs:attribute name="key" use="required" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="page">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:title"/>
        <xs:element ref="export-0.3:id"/>
        <xs:element minOccurs="0" ref="export-0.3:redirect"/>
        <xs:element minOccurs="0" ref="export-0.3:restrictions"/>
        <xs:element ref="export-0.3:revision"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="redirect">
    <xs:complexType/>
  </xs:element>
  <xs:element name="restrictions" type="xs:string"/>
  <xs:element name="revision">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.3:id"/>
        <xs:element ref="export-0.3:timestamp"/>
        <xs:element ref="export-0.3:contributor"/>
        <xs:element minOccurs="0" ref="export-0.3:minor"/>
        <xs:element minOccurs="0" ref="export-0.3:comment"/>
        <xs:element ref="export-0.3:text"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="timestamp" type="xs:NMTOKEN"/>
  <xs:element name="contributor">
    <xs:complexType>
      <xs:choice>
        <xs:element ref="export-0.3:ip"/>
        <xs:sequence>
          <xs:element ref="export-0.3:username"/>
          <xs:element ref="export-0.3:id"/>
        </xs:sequence>
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:element name="ip" type="xs:string"/>
  <xs:element name="username" type="xs:string"/>
  <xs:element name="minor">
    <xs:complexType/>
  </xs:element>
  <xs:element name="comment" type="xs:string"/>
  <xs:element name="text">
    <xs:complexType mixed="true">
      <xs:attribute ref="xml:space" use="required"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
</xs:schema>

That gives me enough to work with, now to tear into the XML with JAXB…..

NB (Later). Well I played with JAXB, but unfortunately there are a number of facets of JAXB which will make this less amenable than XMLBeans, notably the issues surrounding preservation of whitespace & full support for all schema constructs. Close, but unfortunately no cigar on this occasion. So I am now working up an XMLBeans cut, and this is looking promising.

NNB(Later again) Well it looks like XmlBeans won’t play nice with large files, which is a shame because there is a lot of stuff in there that I could have done with. Off to play with STAX /cry….

Categories: Java Tags: , , , , , ,

Prototype for an IPA driven TTS processor

September 27, 2009 Leave a comment

While FreeTTS works and does what it needs to do, I am thinking seriously about the way ahead with open source TTS applications. There is some chatter about TTS mechanisms beginning in the Wikimedia Strategy discussions currently taking place, and I am probably going to get involved in this to a greater or lesser extent if the proposals are adopted – I’ve been engaged with Wikipedia since not long after the outset under the nick sjc. With this in mind, I have started looking for the things which might facilitate this both within & without the Wikimedia context.

IPA Rendition

The contender for immediate consideration is the open-content dictionary, Wiktionary. Many of the entries have IPA and/or SAMPA representations along with the definitions of the words. IPA seems preeminent and despite its train-spotter connotations and antiquity, works +/-. This gives any putative voice driven service or application with access to IPA definitions an immediate head start. Offline usage is potentially an issue unless a method for caching word against IPA rendition is provided which means a lightweight database or a biggish local hashmap supporting word/IPA comparators. Most standard IPA characters in use fall in the range: \u0250 to \u02AF. We can grab these quickly:

/*Top level method to produce a char array within a given range 
* @return an array of standard Unicode characters */
public char[] genIpaCharacterArray()
{  char begin = '\u0250';    char end = '\u02AF';
    return generateRange(begin,end); }

/* Produces an array of unicode chars from beginning to end
* @param cBegin - starting character 
* @param cEnd - ending character
 *@return char[] array containing everything from cBegin to cEnd */
public char[] generateUnicodeCharArray(char cBegin, char cEnd){   
    int start = (int) cBegin;
    int finish = (int) cEnd;
    int xsize = finish - start;
    char[] vOut = new char[xsize];
    int ctr = 0;
    for ( int i = start; i<finish; i++)
    {
    vOut[ctr] = (char) i;
    ctr++;
    }
return vOut;}
}

Now we’ve got our basic IPA characters, we will probably want a byte rendition of each. I need to find a way to do this. .WAV, the obvious file format for doing this has a drawback in that it does not appear to be markable from a javax.sound perspective, and I guess I’m going to have to either find a way to code round this or find some other mechanism for doing it. I’m not helped much in this respect by the fact that my experience with javax.sound is limited.

So now I want to isolate a word from a string, do a look up on Wiktionary, parse the IPA definition, (all easily done with my HttpSpider & MarkupElementParser classes I’ll get round to blogging about someday) and play per IPA note the sound element – this is of course ignoring obvious stuff like cadence, rhythm and stress for the moment, which will also need to be addressed longterm. The identification of IPA/sound file(s) to play can be easily done by data lookup or a hashmap.

What I expect I’ll try initially is something like the quick and dirty method I threw up with last night with half a dozen hastily voiced .wav files of phonemes:

// abbreviated list of imports, IRL I'm much more selective about these
import java.io.*;
import java.net.*;
import java.util.*;
import javax.sound.sampled.*;

public class VoiceMaster extends Thread
{
// for threadedness
public synchronized void invoke(String[] fileNames)
 {
 (new VoiceMaster()).playWord(filesNames);
 }

// Plays an array of .wav files....
public void playWord(String[] filesToPlay)
{
File[] sf = new File[filesToPlay.length];
 AudioInputStream[] audioInputStream = new AudioInputStream();
 try {
 for (int i = 0; i < fx.length; i++)
 {
 sf[i] = new File(fx[i]);
  // Create stream from file, throws IOException or// UnsupportedAudioFileException
 AudioInputStream audioInputStream[i] = AudioSystem.getAudioInputStream(sf[i] );
  playAudioStream( audioInputStream[i] );
// I may or may not want to drain the audioInputStream at this point.....
// audioInputStream[i].drain();
 }
 } catch ( Exception e ) {
  e.printStackTrace();
 }
} 
}

This is obviously a lumpy prototype, and IRL we won’t want to be reinstantiating this class every time we want to play a word. Moreover, it lacks any sort of context whatsoever & doesn’t do anything very much, but by lunchtime with another 40 or 50 .wav files it may just give me something roughly comparable with the FreeTTS system with the plus (or demerit depending on your take on things) that it’ll be my voice and not kevin’s. The real playSound function will probably look much different, acquiring sound either from one large markable sound file or a byte array so that there is not the constant loading and reloading of wavs.