Since there’s no real page for JBLite yet: JBLite is a library of code for converting Jim Breen’s dictionary files into one or more SQLite 3 databases. It is intended to support J-Ben 2, but its BSD license allows anyone to use it. (Note: the generated data, of course, still falls under the original license of JMdict.)
Anyway, I’ve finished creating a converter for JMdict. The code is available from http://repo.or.cz/w/jblite.git.
python -m jblite.jmdict <output_file> JMdict.gz
should allow anyone with Python 2.5+ to convert current versions of JMdict to SQLite 3. This has been tested on Windows 7, and I expect no problems with Linux.
Eventually the code is intended to provide a Python interace to the database as well, something which could be used inside other applications (such as J-Ben 2). This does not yet exist. So, if anyone wants to use this, here’s the schema of the database. If there are errors, let me know.
If you have any questions, please check jmdict_db_schema.rst in the git repository. It gives a few design details.
Happy hacking!
thanks for this. looks very promising. but it fails for me … tables are not being created. thoughts on why this is?
C:\dev\jblite>c:\Python26\ArcGIS10.0\python.exe -m jblite.jmdict test.db JMdict.gz
**** connecting to test.db
Traceback (most recent call last):
File “c:\Python26\ArcGIS10.0\lib\runpy.py”, line 122, in _run_module_as_main
“__main__”, fname, loader, pkg_name)
File “c:\Python26\ArcGIS10.0\lib\runpy.py”, line 34, in _run_code
exec code in run_globals
File “C:\dev\jblite\jblite\jmdict.py”, line 694, in
main()
File “C:\dev\jblite\jblite\jmdict.py”, line 679, in main
results = db.search(search_query)
File “C:\dev\jblite\jblite\jmdict.py”, line 175, in search
entries_from = self.search_from_japanese(query)
File “C:\dev\jblite\jblite\jmdict.py”, line 200, in search_from_japanese
entries_by_keb = self._search_keb(unicode_query)
File “C:\dev\jblite\jblite\jmdict.py”, line 221, in _search_keb
self.cursor.execute(query, args)
sqlite3.OperationalError: no such table: k_ele
Michael:
I think I updated the command line a bit since I made the original post. Sorry!
If you’re running 0.4 (the latest release), basically it goes like this:
————————-
JMdict conversion:
c:\python26\python jmdict.py -i data\JMdict.gz data\JMdict.db
(It’ll say “No results found.” Ignore this.)
JMdict test query (Japanese only right now):
c:\python26\python jmdict.py data\JMdict.db <query>
————————-
Plus, KANJIDIC2 conversion is also working:
c:\python26\python kd2.py -i data\KANJIDIC2.xml.gz data\kd2.db
KANJIDIC2 test queries:
c:\python26\python kd2.py data\kd2.db --lookup <character(s)>
c:\python26\python kd2.py data\kd2.db --search <yomigana>
————————-
And fair warning: the conversion of JMdict takes a lot of memory. If you got a gig of memory or more, you’re probably okay. It’s the price for me using cElementTree in the converter…
Ah. Much better. Thanks very much for this. JBrute currently uses a special purpose indexing engine (Lucene). I’d like to try with a a RDBMS.
que herramienta mas grandiosa, realmente muchas gracias por compartirla..
Hi. I have not tried out JMdictdb, but this looks promising without having to download all of those tools. However, I have a few questions.
I tried to use the regular JMdict xml file with multiple languages and this parser did not work. However, it worked with the JMDict_e xml file.
Can you give more examples on how to query things like definitions, kanji, and kana? This database is fairly complicated.
This seems to work with FTS when I do MATCH even though there is not a FTS table. Why is that?
Hi John,
Unfortunately, I no longer maintain this project. It’s been a long time since I’ve looked at it.
When I wrote it, it was intended to work with the original JMdict (not just JMdict_e), and it did work with it. I can’t remember whether I coded this in Python 2 or 3; my guess is 2. You may want to try both, in case it’s an issue with how the program handles encoding of Unicode strings. Also, if you were trying this on Windows, maybe you may have better luck on Linux? I’ve sometimes had problems with encoding issues when running via a Windows console window.
I can’t really provide any examples because it’s been so long; I don’t have any. This code is basically “as-is”. All I can say is, I tried to capture basically everything which was in the original JMDict; a lot of the complexity is because of JMDict’s own structure.
Unfortunately I’m not familiar with what FTS is; I cannot comment.
Again, the code is as-is, but maybe it’s a start for you. If you use it, great! If not, well, I hope you find something that works for you. ๐
Best Regards,
Paul
Ah, I was just looking at alternatives to parsing it myself since I know it would take a long time. I tried to parse the text file JMdict_e.txt but it was very difficult. I don’t think it’s possible to get some information either unless you spend a lot of time studying it and writing if statements. Like trying to find (por: espada) or something in the middle of an entry. It is regular expression hell too.
FTS is full text search. You can use MATCH to match words instead of LIKE queries. It is supposed to be faster, but of course it is a different query with different results. It is mainly used for the definitions.
How do you know this database works if you only wrote one example then in your readme? ๐
> How do you know this database works if you only wrote one example then in your readme? ๐
Because I was using this in the development of J-Ben 2. That actually might be a good way to see usage, is to look at the J-Ben 2 source code.
J-Ben 2 is very limited, but it does have search-by-literal support which works with JMdict, or at least used to.
Here’s what I can quickly scrape from source code:
jben-1.9.7/python/jben/dict.py:81:
This is then assigned to a different object as self.dict…
Finally:
python/jben/interface/gtk/widget/kanjidict.py:21:
Anyway: J-Ben 2 supports search by literal, and its display, etc., is driven by jblite. If you want examples beyond the above, that’s where I’d suggest looking.
Update: Oops, this is what I get for just waking up and writing this in a hurry. Obviously I gave you an example of the kanjidict usage… And it looks like worddict.py, which would use jmdict, is just a stub. The only reason I can say “jmdict” “works” is because I was tinkering around with it and it appeared to work, and there was more than that one example.
However, I don’t have this stuff anymore, nor am I going to search for it. Sorry, but this project is not being maintained and I really am no longer interested in doing so.
Best of luck,
– Paul
Let me add one more response: maybe this did not really “work”; I think my memory was clouded in that I know the kanjidic portion was working to a minimal degree with by-literal search. I’m pretty sure I had JMdict working to where I could parse the entire JMdict file without error (so whatever errors you’re having, I don’t know; I did not used to get them back when I wrote this.)
That being said, actually doing a search is something that now, when I really think about it, I’m not sure I had this quite working; else I would have wired it into J-Ben 2’s search tab. The high-level query interface likely did not make it before I got burned out on the project.
Basically: *if* the parsing works, you should at least have an SQLite database that you can query in some way to get the data out. How to query it is left up to you – it is SQLite, you can query the schema, and toy with it from there.
If the parsing itself doesn’t work… I don’t really want to work on this, but could you at least give me a traceback of the error? I might at least be able to provide you a hint ๐
Best Regards,
Paul
I’m looking at J-Ben 1 now. It seems you parsed it using the EDICT text file? What did your sqlite tables look like back then?
J-ben 2 all versions seem to not work for me. gives me a dll import module error but fails to tell me what I need.
J-Ben 1 did not use SQLite. J-Ben 2 was a total rewrite of J-Ben 1, using Python as the implementation language and SQLite as a database format for better supporting queries. (Unfortunately, I burned out on the project and did not finish it.)
J-Ben 2 beta… actually, I just downloaded and ran it via the installer, and I was able to do word searches. Granted it’s very naive in how it searches, but it did work.
If you’re running from sources (presumably), read the enclosed INSTALL.txt; it should give installation details for both Windows and Linux. That being said, Windows installation is much less trivial than Linux. You have a much higher chance of success if you run via sources on Linux.
Let me poke around (briefly) and see if I can find something with regards to word search…
From jben-1.9.7/python/jben/interface/gtk/widget/search_frame.py:95:
…where self.dict is a jblite.jmdict.Database instance.
Hope this helps… and even if you are getting some sort of dll import module error, can you please include a traceback or some sort of full error message? Even if it’s not complete, maybe I can give you a hint on context. Without it, I really cannot help much more than what I’ve done here.
Nah, I was running the Windows installer and got that error. Yeah, nevermind about the sqlite thing on J-Ben 1. The source code for J-Ben 2 is not available unfortunately as well. I applaud you though on J-Ben 1 because that is some hard C++ code. I’m too rusty on that language so I couldn’t do that myself.
Traceback (most recent call last):
File “jben.py”, line 9, in
File “jben\main.pyc”, line 19, in main
File “jben\app.pyc”, line 15, in __init__
File “jben\interface\gtk\__init__.pyc”, line 7, in
File “gtk\__init__.pyc”, line 40, in
File “gtk\__gtk.pyc”, line 12, in
File “gtk\__gtk.pyc”, line 10, in __load
ImportError: DLL load failed: The specified procedure could not be found.
This is on 1.9.3 Beta on VISTA. I also tried on WINEBOTTLER on Mac OS X and didn’t work.
If you find the time, please give more SQL examples on how to query the parsed JMDict sqlite database.
in for the blanks, sorry didn’t type it right
module for the blanks
Thanks with regards to J-Ben 1. It was not fun working with GTKmm; that’s perhaps the main reason for the Python rewrite. Some of the complexity may of course have been my own.
Thanks for the traceback as well – that actually helps *a lot*. It’s a GTK dependency, and I likely did not quite have that bundled right in the Windows build. You may get lucky if you can download and install GTK+ from somewhere and have that in your PATH. However, I won’t create a new installer nor debug this.
Even if it’s broken, I abandoned this project and am spending too much time on it now. If it’s broken, it stays broken.
I don’t have any more examples; sorry. Best thing I can refer you to is the J-Ben 2 sources.
J-Ben 2 sources are very much available; the .tar.gz file from the “downloads” page is a source bundle. Link: http://jben2.vultaire.net/files/jben2/jben-1.9.7.tar.gz
Additionally, there’s sources on repo.or.cz: http://repo.or.cz/w/jben2_gui.git. It appears that this link was lost at some time in the past from the J-Ben 2 project page; at least I cannot easily find it. So, I’m providing it here – I don’t think there’s a very wide delta between Git and the 1.9.7 tarball, though.
Beyond that: I really think I’m spending too much time on this; I’m sorry. I have to ask you to work though the J-Ben 2 sources, and strongly advise doing so under Linux. Never tried it on OS X, and building on Windows has lots of requirements and is a real pain (part of the reason I’m not interested in repairing the broken build).
Best of luck, and Best Regards,
– Paul
(Last updated 9:19pm Pacific Daylight Time – my apologies if you saw the first version of this post; I was not doing well and was more abrasive than I should have been. ๐ While I don’t really want to invest more time in the project, I also do want you to at least have a lead to be able to work from.)
I didn’t see your message beforehand so I don’t know what you said. ^^
I’ve figured out how to get the information under entry in the JMDict database, but how do you match the entity table’s entity and expansion with say k_inf.entity or re_inf.entity without doing another query “FROM entity”? Entity and Entry don’t seem to be related and Entity doesn’t seem to have a FK so I was doing things like LEFT JOIN entity ON entity.id = ke_inf.fk but that didn’t work.
SELECT entity.entity, entity.expansion FROM entry
LEFT JOIN k_ele ON entry.id = k_ele.fk
LEFT JOIN ke_inf ON k_ele.id = ke_inf.fk
LEFT JOIN entity ON entity.id = entry.id
WHERE k_ele.value = “ๆ็ฝ”
This returns entity and expansion 19 and 20 from entity, but it should return just 19. It’s looking at the fk of k_ele and mapping that to the entity table instead of ke_inf mapping to entity. How can I do a query like this?
I do appreciate you making this parser by the way. Most people just parse the text file which is a nightmare to do yourself. I couldn’t find a parser for it either.
Correct; entity and entry are basically not related.
Entity refers to an XML entity. If you look at the JMDict XML you will understand. You’ll see strings like “&v5u-s;” which refer to the “v5u-s” entity, meaning “Godan verb with `u’ ending (special class)”. These fields from the XML are represented by an “entity” column in the sqlite database, and the entity can be looked up by joining with the entity table. Another way of saying it: entity is just a string table.
In other words, for basic queries you should not need to join with entity, and the join in the example between entry and entity does not make sense.
I’m not quite sure what you’re trying to do, but assuming that you’re trying to search for “ๆ็ฝ” in the db, I would try something like:
This will get you the entry ID of the entry in question. Also, ent_seq is useful since it’s the unique ID used within the JMdict XML doc, so if you want to cross-reference the original doc this’ll let you do it.
With the entry ID, here’s how you can get the Japanese readings of a given entry:
…And finally, here’s the glosses, i.e. the foreign language meanings:
Hope this helps!
Ah, thanks ^^
I’ve run into another query dilemma. This is particularly hard, so I’m wondering if you know how to do this.
Let’s say I have a table with kanji, kana, and entry columns.
How can I put kanji, kana, and gloss values in these columns while taking into account re_restr (reading restricted to x kana), stagk (sense limited to this kanji), and stagr (sense limited to this reading)? This is proving to be very difficult.
I tried to do
SELECT k_ele.value, r_ele.value, gloss.value FROM entry
LEFT JOIN k_ele ON entry.id = k_ele.fk
LEFT JOIN r_ele ON entry.id = r_ele.fk
LEFT JOIN sense ON entry.id = sense.fk
LEFT JOIN gloss ON sense.id = gloss.fk
LEFT JOIN stagr ON sense.id = stagr.fk
LEFT JOIN stagk ON sense.id = stagk.fk
but that doesn’t work… and doesn’t take into account stagr and stagk correctly.
I know
SELECT gloss.value FROM entry
LEFT JOIN sense ON entry.id = sense.fk
LEFT JOIN gloss ON sense.id = gloss.fk
LEFT JOIN stagr ON sense.id = stagr.fk
WHERE stagr.value != “(SELECT r_ele.value FROM entry LEFT JOIN r_ele ON entry.id = r_ele.fk)”
works to get the gloss value for the readings associated with stagr values, but I dunno how to incorporate that into the query at the beginning because whenever I do LEFT JOIN on k_ele or r_ele to get k_ele.value and r_ele.value, the results are wrong
I was able to get the kanji associated with each stagr and the correct gloss with this query, but that’s not what i want unfortunately.
SELECT k_ele.value, r_ele.value, gloss.value FROM entry
LEFT JOIN k_ele ON entry.id = k_ele.fk
LEFT JOIN r_ele ON entry.id = r_ele.fk
LEFT JOIN sense ON entry.id = sense.fk
LEFT JOIN gloss ON sense.id = gloss.fk
LEFT JOIN stagk ON sense.id = stagk.fk
LEFT JOIN stagr ON sense.id = stagr.fk
WHERE stagr.value != “(SELECT r_ele.value FROM entry LEFT JOIN r_ele ON entry.id = r_ele.fk)”
AND r_ele.value = stagr.value
I want to get all of the regular k_ele.value, r_ele.value, and gloss_value, but with specific gloss changed according to stagk and stagr. I’m not concerned with re_restr.
Hi,
No time to look at the comments in detail yet…
Are you sure that query is correct? Maybe SQLite works differently than MySQL (which I’m more used to), but I would have expected something more like:
i.e. no quotes around the subquery.
Further, if you’re doing particularly complex stuff, you may want to consider just doing several queries and storing all the data in an object. i.e. read the readings into a list or lists, read the glosses into a list or lists, etc. That might be easier to work with than trying to force it into a single SQL query. …just a thought; as I said, I haven’t looked at what you’re doing in detail.
Looking a little deeper: while it may be possible to cram this all into a single query (with nested subquery(s)), I’m not sure this is what I would advise.
I think you should separate things into several steps:
1. Find the entry. This is some form of search query with the target being one or more entry IDs.
2. Get all data needed for operating on the the entry or entries. Very possibly, this will be more than one SQL query. If you jam everything into one, you’re likely to have lots of duplicated fields, etc. “SELECT DISTINCT” helps only so far. Focus on relatively simple and easy to understand queries if you can. Don’t “optimize” by cramming stuff into one query unless you know you need to because of a profiled bottleneck. Besides, a massive SQL query can be hard to maintain and debug.
3. Operate on the retrieved data. This means you may retrieve more data than you needed, but you can use your programming language to filter, etc.
I don’t really have a “solution” to your problem, but hopefully this perspective helps ๐
Regarding this database, I don’t see where these values are stored
kombinat
system
kombinat and system. They’re not in lsource and not in gloss so I’m not sure. Maybe the parser needs to be updated?
hmm doesn’t like XML
lsource xml:lang=”rus” ls_type=”part” kombinat /lsource
lsource xml:lang=”eng” ls_type=”part” system /lsource
lsource should have a value column i think
try [ code language=”sql”][/code] for quoting SQL or [ code language=”xml”][/code] for quoting XML. (Take the space out in the command; I had to add it in so it didn’t get interpreted.)
lsource: seems JMDict’s format has been updated; the parser no longer matches. It would need to be updated.
I think I fixed it. Wasn’t too hard ๐
https://gist.github.com/John61590/5744756
I just used the advice you gave at post #18 and had tremendous success. It is so much faster doing those queries than how I was doing it.
I used to do
SELECT pos.entity FROM entry
LEFT JOIN sense ON entry.id = sense.fk
LEFT JOIN k_ele ON entry.id = k_ele.fk
LEFT JOIN r_ele ON entry.id = r_ele.fk
LEFT JOIN pos ON sense.id = pos.fk
WHERE r_ele.value = “ใจใใใ” AND k_ele.value = “็ญ้”
but for EVERYTHING in sense… it was so slow you have no idea. Now, I’ve been trying to figure out if I really need to do LEFT JOINs on everything or not to try and optimize it even more. In some cases I haven’t had the need to but I’m still looking. Sometimes I’m wrong.
Hi John,
Thanks for your efforts, and glad it seems to be working out!
Side topic first: I confess, this has been kind of fun for me lately. It’s been a long time since I’ve toyed with developing this. I’ve been investigating ways to reduce memory usage, improve speed, and otherwise future-proof the library. The first two may be of limited use since they really only effect creation of the SQLite DB, but the third looks like something I can maybe make some meaningful changes for. If I can have the DB be auto-generated based upon the DTD, then changes such as the one you had to perform may become less necessary.
Answering your question:
Left joins allow us to handle the case where the parent-child relationship between XML nodes is 0 or many (0:m). If we knew a particular relationship were a 1 to many (1:m) relationship, I think an inner join would be fine. But if that relationship really turned out to be 0:m, you’d potentially risk missing some data in the case where there are no child entries.
It may even be worth toying with the speed of reducing or eliminating table joins and doing more of that type of logic in Python; however, I think this is likely a balancing act that may have diminishing returns.
Best suggestion: use the sqlite3 command line client from sqlite.org, connect directly to the database, and try using the EXPLAIN QUERY PLAN function to get details about how your queries are interpreted. This can help you determine if you’re trying to grab too much, etc. Then you can determine whether to refactor based upon real data rather than stabbing in the dark. ๐
Example session (using Linux since Windows console generally works poorly in UTF-8 mode):
This type of data may give you hints as to where you may want to add indices to the database. For example, jblite does not provide indices on k_ele.value nor r_ele.value; for this type of query it may be beneficial.
Hope this gives you something to work with!
– Paul
“jblite does not provide indices on k_ele.value nor r_ele.value” no wonder my queries were slow before I used entryID and senseID -_-
By the way, I’ve been trying to come up with a way to make a certain table with an output of something like this:
Let’s have this example from the text file: (I find this to be good in debugging code since so many things going on)
ๆผใ(P);ๅงใ;ๆบใ [ใใ] /(v5s,vt) (1) (ๆผใ only) to push/to press/(2) (ๆผใ, ๅงใ only) to apply pressure from above/to press down/(3) (ๆผใ, ๆบใ only) (See ๅคใๆผใ) to stamp (i.e. a passport)/to apply a seal/(4) (ๆผใ only) to affix (e.g. gold leaf)/(5) (ๆผใ only) to press (someone for something)/to urge/to compel/to influence/(6) (ๆผใ, ๅงใ only) to overwhelm/to overpower/to repress/(7) (ๆผใ only) to push (events along)/to advance (a plan)/(8) (ๆผใ only) to do in spite of …/to do even though …/to force/(9) (ๆผใ only) (See ๅฟตใๆผใใปใญใใใใ,้ง็ฎใๆผใใปใ ใใใใ) to make sure/(10) (ๆผใ only) to be pressed for time/(11) (ๆผใ only) to advance troops/to attack/(12) (ๆผใ only) (of light) to be diffused across an entire surface/(P)/EntL1180470X/
Kanji Kana Entry
… … …/…
ๆผใ ใใ to push/to press
ๆผใใใใ to apply pressure from above/to press down
ๆผใใใใ to stamp (i.e. a passport)/to apply a seal
…
ๆผใ … …
but for everything in k_ele.value and r_ele.value. The only way to do this I think is to do multiple queries to get multiple sense IDs but I’m not really sure how to get a specific sense ID without knowing a single gloss value. I was able to return all of the gloss values with a sense ID found from a gloss value though of course.
I think in this particular case it’s best for me not to answer, and to let you experiment and try to figure it out on your own. In part because I don’t know an answer off-hand; basically I would need to do the same work you need to in order to figure this out. ๐
Hi Paul Goins! How are you?
I need help from you.
Now I am developing Japanese-English dictionary. Search by japanese words is working perfectly. But by english words is not perfect. For example search with ‘boy’ is giving result with ‘otokonoko’ in the 17th line. How to make search by english words better?
I have done in this way:
//with SQLite Expert Professional
I have created db according to this schema.
Then created FTS table:
CREATE VIRTUAL TABLE search_eng_fts using FTS3(entry_id, re_value, ke_value, g_value);
//with SQLite Expert Professional
//filling temporary table
INSERT INTO temp.searchtemp(entry_id,re_value,ke_value,g_value)
SELECT e.id AS entry_id,
re.value AS re_value,
GROUP_CONCAT(DISTINCT ke.value) AS ke_value,
GROUP_CONCAT(DISTINCT g.value) AS g_value
FROM entry e
INNER JOIN r_ele re ON e.id = re.fk
INNER JOIN sense s ON e.id = s.fk
INNER JOIN gloss g ON s.id = g.fk
LEFT JOIN k_ele ke ON e.id = ke.fk
WHERE g.lang IS NULL
GROUP BY s.id, re_value
ORDER BY re_value;
//with SQLite Expert Professional
//filling fts table
INSERT INTO search_eng_fts(entry_id, re_value, ke_value, g_value)
SELECT entry_id, re_value, ke_value,
GROUP_CONCAT(g_value, “; “) AS g_value
FROM temp.searchtemp
GROUP BY entry_id, re_value
ORDER BY re_value;
//in Eclipse
query = querySelectFromTableWhere + ” g_value MATCH ‘”
+ lookingFor + “‘ AND g_value LIKE ‘”+lookingFor+”%’ LIMIT 100″;
cursor = db.rawQuery(query, null);
if (!cursor.moveToFirst() && lookingFor.length() > 2) {
query = querySelectFromTableWhere + ” g_value MATCH ‘”
+ lookingFor + “*’ AND g_value LIKE ‘”+lookingFor+”%’ LIMIT 100″;
cursor = db.rawQuery(query, null);
if(!cursor.moveToFirst() && lookingFor.length() > 3)
{
query = querySelectFromTableWhere + ” g_value MATCH ‘*”
+ lookingFor + “*’ AND g_value LIKE ‘”+lookingFor+”%’ LIMIT 100”;
}
}
What I am doing wrong? Please give some suggestion how I can improve search by english words. I hope for your help. Thank you!
Hi Joe,
I’m not really an expert at full-text search. I only got very rudimentary searching to work, as in, “if ‘boy’ is anywhere in the glosses for this entry, show this entry.” This is an area I need to learn more about myself; I don’t really have a good answer.
jblite is good for generating a MySQL database from Jim Breen’s XML databases, but beyond that, it’s up to you to figure out how to query it, etc., to make it work for your application.
Wish I could offer more help, but I think it’s best for you to try to figure this out yourself.
If there’s one thing I may suggest: I’ve written an updated script for generating jmdict and kanjidic2 databases. It’s located at https://github.com/Vultaire/jben_web_scripts. Maybe it may be useful? It’s a simpler and more flexible solution than jblite. If you’re serious about using my scripts to generate a database for your app, I suggest using this new script.
Best of luck,
– Paul