[sword-devel] Script to find a best fit v11n
Greg Hellings
greg.hellings at gmail.com
Thu Jun 19 15:24:58 EDT 2025
On Thu, Jun 19, 2025 at 9:07 AM DM Smith <dmsmith at crosswire.org> wrote:
> Greg,
> There’s an extraneous %s in the output.
>
Ah, not surprising. That is the old, Python 2 way of formatting variables
into a string, similar to C style printf syntax with variable arguments
coming in a tuple after an overload of the modulus operator (so it would
look like `"this is a string: %s" % (a_string, )` ). The modern preferred
way is with an f-string, where you preface a string with the character `f`
and then reference variables in the string with {variable_name} syntax
(e.g. `f"this is a string: {a_string}"`). That %s can be killed off, or
replaced with an f-string equivalent.
>
> If you put the enumeration after the line "There are 93 OT IDs and 5 NT
> IDs in v11n which aren’t in your file.” Then you wouldn’t need the heading
> "The following IDs don’t appear in your file:”
>
Yeah, I had been putting the IDs out to stderr with the logging utility
previously. It was only yesterday when I was squashing the remaining Python
3 compat issues that I realized I should just drop them into a print
statement. They are, thusly, kinda crazy. In fact, I pass them through a
`sort` call, so they won't be in either canonical or document order -
unless the document has its verses sorted alphabetically by osisID
attribute for some inexplicable reason.
> It’d also be nice to format it a few per line, indented appropriately.
>
Perhaps broken up by book? Or by book/chapter So it's like
Verses missing from:
Gen
1 - 1, 3, 5, 7
2 - 11, 22
Exo
27 - 1
There is a long way to go to improve the output, especially of this detail
portion. It was, after all, only intended as debugging output for me while
I was writing it.
>
> I’d be happy to iterate over any suggestions we agree on.
>
As I am not a user of it, nor an intended consumer of it, feel free to
improve it as needed! I quickly hacked it together and tossed it out into
the world at someone's off-handed request. I don't create modules, though,
so I have no vested interest in preserving its current operation in any
particular form. And, if this thread has shown anything, it's that likely
Peter has been the only user to date. So I doubt you'll disturb anyone else
with it.
If you need my support for anything, I'm happy to lend a hand.
Pulling in comments from your other email on this thread:
> I like that it's very simple to read. Having a summary is good. And the
other email which lists the exact ids extra/missing per testament is very
helpful.
> I think that enumerating the names of the extra/missing books and
extra/missing chapters would be good. No sense in enumerating the ids
within these.
That probably would be good. I didn't include detection for an entire
missing chapter or book, but it shouldn't be too terribly difficult to
enhance it with that. A simple brute force check of every detected missing
book or chapter to see if there are any matched verses can reveal that
pretty easily.
> I ran mine against an input that was a test case for osis2mod’s infinite
loop and it had 2 extra books and 13 extra chapters. This wouldn’t be
obvious in your results.
True, mine would just complain about hundreds or even thousands of
mismatches and silently swallow the list of what those are. I had a few of
those that I omitted from the sample output I captured. For instance, there
are large portions of the canon for the Catholic versifications missing
from the KJV file. It just lists of something absurd like "There are 4,741
missing verses" or whatever it is.
> Is it an advantage or disadvantage to be compiled against SWORD lib vs
slurping header files?
Like most things, it's a trade-off. Working with the bindings requires that
the Sword bindings are installed on the host system. For someone running on
Windows, this is particularly non-trivial. For someone running in macOS
it's not too difficult to install from source (I don't believe Homebrew
builds them). For users of major Linux distributions, it's downright
trivial. On Fedora it's as simple as a single `dnf install python3-sword`
command for a long time now, and it looks like the bindings are also
available for Ubuntu starting in 25.04 with an `apt install python3-sword`
as well.
Advantages of the binding method are that it doesn't rely on parsing a C
header file, nor on the file laying out the values in a certain way. It
also can be used offline easily, doesn't require parsing the output of HTML
in order to find all the applicable files, and is likely slightly faster.
Not that the speed probably matters for a single run of this, but if you're
bulk processing files the speed advantages can add up.
Disadvantages of the binding method are that it's requiring you to revert
back to a source build if you are using this to test a canon.h file or if
you want to use a canon file that isn't available in the package manager of
your Linux distribution. Building from source isn't terribly onerous for
most of us contributors but it might be more of a problem for a module
maintainer. Then again, how often do we add a new versification to the code
base?
So there are pros and cons between them. I was freshly off of getting the
bindings to compile when I wrote the first draft of av11n.py so I naturally
went that direction. I also try to avoid writing parsers when I can
leverage existing ones, as grammars can be notoriously complex to get
correct. So that dictated my choices as much as did anything else, really!
Another possible enhancement might be a CLI flag to limit the testing range
to a particular book (or testament) at a time. I have heard people talk
about having modules split up to one book per file or similar. If they
could say, "Only check this file against Joshua" then it could keep down a
significant amount of extra output. But again - I'm not really an intended
user of it!
--Greg
> DM
>
> On Jun 19, 2025, at 12:12 AM, Greg Hellings <greg.hellings at gmail.com>
> wrote:
>
> And here's an example now that I've fixed the output of the osisIDs when
> there are fewer than 100 of them:
>
> [vagrant at localhost ~]$ ./av11n.py kjv.osis.xml
>
>
>
> Checking Calvin:
> ----------------
> The following IDs don’t appear in your file:
> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15,
> Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37,
> Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35,
> Job.39.36, Job.39.37, Job.39.38
> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53,
> Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14,
> Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9,
> Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27,
> Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20,
> Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24,
> Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14,
> Ps.67.8, Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13,
> Ps.77.21, Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14,
> Ps.88.19, Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
> There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.
> The following IDs don’t appear in v11n:
>
> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15,
> Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37,
> Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35,
> Job.39.36, Job.39.37, Job.39.38
> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53,
> Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14,
> Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9,
> Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27,
> Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20,
> Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24,
> Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14,
> Ps.67.8, Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13,
> Ps.77.21, Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14,
> Ps.88.19, Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
> There are 1 OT IDs and 29 NT IDs in your file which don’t appear
> in v11n.
>
>
> On Wed, Jun 18, 2025 at 11:00 PM Greg Hellings <greg.hellings at gmail.com>
> wrote:
>
>> Here is an example of the first lines of running my script against the
>> kjv.osis.xml file from the git repo:
>>
>>
>> Checking Calvin:
>> ----------------
>> There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your
>> file.
>> There are 0 OT IDs and 30 NT IDs in your file which don’t appear
>> in v11n.
>>
>> Checking Catholic:
>> ------------------
>> There are 4530 OT IDs and 3 NT IDs in v11n which aren’t in your
>> file.
>> There are 0 OT IDs and 133 NT IDs in your file which don’t appear
>> in v11n.
>>
>> Checking Catholic2:
>> -------------------
>> There are 4638 OT IDs and 3 NT IDs in v11n which aren’t in your
>> file.
>> There are 0 OT IDs and 133 NT IDs in your file which don’t appear
>> in v11n.
>>
>> Checking DarbyFr:
>> -----------------
>> There are 31 OT IDs and 4 NT IDs in v11n which aren’t in your
>> file.
>> There are 0 OT IDs and 30 NT IDs in your file which don’t appear
>> in v11n.
>>
>> This continues on to include such output as
>>
>>
>>
>> Checking KJV:
>> -------------
>> Your file has all the references in this v11n
>> Your file has no extra references
>>
>>
>>
>> Checking KJVA:
>> --------------
>> There are 5717 OT IDs and 0 NT IDs in v11n which aren’t in your
>> file.
>> Your file has no extra references
>>
>> giving a clear example of a winner for this particular file.
>>
>> Meanwhile, running it against the kjva.osis.xml file includes this in the
>> results:
>>
>> ...
>>
>> Checking KJV:
>> -------------
>> Your file has all the references in this v11n
>> There are 2 OT IDs and 5715 NT IDs in your file which don’t
>> appear in v11n.
>>
>> Checking KJVA:
>>
>> --------------
>>
>> Your file has all the references in this v11n
>> Your file has no extra references
>> ...
>>
>> Fiddling with the file has showed me there are a couple of places where I
>> need to tweak it for Python 3 compatibility that I missed the last time I
>> updated. But fixing those couple of little syntax issues resulted in it
>> running just fine in a Fedora 41 vm with nothing more to do than invoke
>> `dnf install python3-sword` to setup the system to use it.
>>
>> --Greg
>>
>> On Wed, Jun 18, 2025 at 10:40 PM Greg Hellings <greg.hellings at gmail.com>
>> wrote:
>>
>>> My script eschews percentages because they seemed relatively pointless
>>> to me for measuring a mismatch like this. Instead it gives a count of both
>>> Old and New Testament osisIDs that it finds missing and another that it
>>> finds unexpectedly for a given versification. If the total of either count
>>> is fewer than 100, the IDs for that particular count are printed to the
>>> console. It will do this for every registered versification in the version
>>> of the library it was compiled against, allowing the user to select
>>> whichever one seems best to them based on the results.
>>>
>>> On Wed, Jun 18, 2025, 10:25 PM David Haslam <dfhdfh at protonmail.com>
>>> wrote:
>>>
>>>> It’s not just the number of “missing” verses that should figure in the
>>>> percentage score, but also the number of verses that get concatenated to
>>>> the last one in a chapter.
>>>>
>>>> The differences in v11n for the Psalms will be especially significant
>>>> for this, in that some v11n renumber many of them. Likewise for the last
>>>> few chapters in the book of Job.
>>>>
>>>> Aside: It would be cool to enhance the utility emptyvss by providing a
>>>> command line option that would ignore books that are not included in the
>>>> scope parameter in the conf file.
>>>>
>>>> Regards,
>>>>
>>>> David
>>>>
>>>> On Thu, Jun 19, 2025 at 03:18, DM Smith <dmsmith at crosswire.org
>>>> <On+Thu,+Jun+19,+2025+at+03:18,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>
>>>> David,
>>>>
>>>> Because it only considers the xml, scope is automatically built into
>>>> it. It is only comparing what is present in the xml with what is part of
>>>> the av11ns.
>>>>
>>>> It might be good to add the enumeration of missing verses.
>>>>
>>>> — DM
>>>>
>>>> On Jun 18, 2025, at 4:02 PM, David Haslam <dfhdfh at protonmail.com>
>>>> wrote:
>>>>
>>>> Does it take account of the Scope key in the .conf file for a less than
>>>> complete Bible ?
>>>>
>>>> David
>>>>
>>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>>
>>>>
>>>> On Wed, Jun 18, 2025 at 20:51, DM Smith < dmsmith at crosswire.org
>>>> <On+Wed,+Jun+18,+2025+at+20:51,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Several have commented on how hard it is to test an OSIS xml file
>>>> against v11ns especially since it goes off into an infinite loop. (I’ve
>>>> posted a patch that fixes that) But it is still a process of trial and
>>>> error to find an appropriate v11n.
>>>>
>>>> So, I’ve been iterating with chatGPT to create a python script to find
>>>> a best fit v11n. Since I don’t know python, I can’t vouch for the script
>>>> beyond it worked for a simple test case that had an extra chapter for
>>>> Genesis and had some extra verses at the end of a chapter in that book.
>>>>
>>>> I offer it, as a starting place. See the attached file.
>>>>
>>>> It has a —debug flag.
>>>> The first argument is expected to be the OSIS xml file.
>>>> The second argument is optional and gives the location to the include
>>>> directory of svn/sword/trunk/include with all the canon*.h files. If you
>>>> don’t supply the argument, it uses the web to load the canon*.h files from
>>>> https://www.crosswire.org/svn/sword/trunk/include.
>>>>
>>>> It will score the fitness of each of the v11ns. It gives the score as a
>>>> %, but I don’t know what that means. I told it that it should prioritize
>>>> book matches, then chapter matches and finally verse matches. I don’t know
>>>> how well it did that scoring. I didn’t test for that.
>>>>
>>>> The output is alphabetized. If more than one v11n have the same high
>>>> score, they are listed.
>>>>
>>>> In His Service,
>>>> DM
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250619/075bf34a/attachment-0001.htm>
More information about the sword-devel
mailing list