This was a rough problem. I’m using Python’s SAX modules for parsing JMdict, and ran into a problem regarding its use of XML entities. Nothing wrong with JMdict, but the expansion is rather verbose and does not lend itself well to what I’m doing. I don’t want “word containing irregular kanji usage”, but rather, I want the “iK” code.
Unfortunately, the default ExpatParser doesn’t have a clear way to disable this. But if you read the docs closely enough, you can find out about setting a “default handler” for it, which has the side effect of disabling internal expansion.
This isn’t a perfect fix, but here’s the class I used to get this done:
# Code is Copyright 2009 by Paul Goins # Released into the public domain - but let me know if this is useful! from xml.sax.expatreader import ExpatParser class ExpatParserNoEntityExp(ExpatParser): """An overridden Expat parser class which disables entity expansion.""" def reset(self): ExpatParser.reset(self) self._parser.DefaultHandler = self.dummy_handler def dummy_handler(self, *args, **kwargs): pass
So, use an instance of this class instead of one by xml.sax.make_parser (or xml.sax.parse, for that matter). Then, in your xml.sax.handler.ContentHandler, specify a handler for skippedEntry(self, name), since that’s where your internal entities are going to pop up. By the way, built-ins such as & and < still seem to be parsed; at least in the case of JMdict it seems to just affect the parsing of the inline DTD.
Hope this helps someone!