[sword-devel] Script to find a best fit v11n

DM Smith dmsmith at crosswire.org
Thu Jun 19 17:12:43 EDT 2025


> On Jun 19, 2025, at 3:24 PM, Greg Hellings <greg.hellings at gmail.com> wrote:
> 
> 
> 
> On Thu, Jun 19, 2025 at 9:07 AM DM Smith <dmsmith at crosswire.org <mailto:dmsmith at crosswire.org>> wrote:
>> Greg,
>> There’s an extraneous %s in the output.
> 
> Ah, not surprising. That is the old, Python 2 way of formatting variables into a string, similar to C style printf syntax with variable arguments coming in a tuple after an overload of the modulus operator (so it would look like `"this is a string: %s" % (a_string, )` ). The modern preferred way is with an f-string, where you preface a string with the character `f` and then reference variables in the string with {variable_name} syntax (e.g. `f"this is a string: {a_string}"`). That %s can be killed off, or replaced with an f-string equivalent.
>  
>> 
>> If you put the enumeration after the line "There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.” Then you wouldn’t need the heading "The following IDs don’t appear in your file:”
> 
> Yeah, I had been putting the IDs out to stderr with the logging utility previously. It was only yesterday when I was squashing the remaining Python 3 compat issues that I realized I should just drop them into a print statement. They are, thusly, kinda crazy. In fact, I pass them through a `sort` call, so they won't be in either canonical or document order - unless the document has its verses sorted alphabetically by osisID attribute for some inexplicable reason.
>  
>> It’d also be nice to format it a few per line, indented appropriately.
> 
> Perhaps broken up by book? Or by book/chapter So it's like
> Verses missing from:
> Gen
>    1 - 1, 3, 5, 7
>    2 - 11, 22
> Exo
>   27 - 1
> 
> There is a long way to go to improve the output, especially of this detail portion. It was, after all, only intended as debugging output for me while I was writing it.
>  
>> 
>> I’d be happy to iterate over any suggestions we agree on.
> 
> As I am not a user of it, nor an intended consumer of it, feel free to improve it as needed! I quickly hacked it together and tossed it out into the world at someone's off-handed request. I don't create modules, though, so I have no vested interest in preserving its current operation in any particular form. And, if this thread has shown anything, it's that likely Peter has been the only user to date. So I doubt you'll disturb anyone else with it.
> 
> If you need my support for anything, I'm happy to lend a hand.
> 
> Pulling in comments from your other email on this thread:
> 
> > I like that it's very simple to read. Having a summary is good. And the other email which lists the exact ids extra/missing per testament is very helpful.
> > I think that enumerating the names of the extra/missing books and extra/missing chapters would be good. No sense in enumerating the ids within these.
> 
> That probably would be good. I didn't include detection for an entire missing chapter or book, but it shouldn't be too terribly difficult to enhance it with that. A simple brute force check of every detected missing book or chapter to see if there are any matched verses can reveal that pretty easily.
> 
> > I ran mine against an input that was a test case for osis2mod’s infinite loop and it had 2 extra books and 13 extra chapters. This wouldn’t be obvious in your results.
> 
> True, mine would just complain about hundreds or even thousands of mismatches and silently swallow the list of what those are. I had a few of those that I omitted from the sample output I captured. For instance, there are large portions of the canon for the Catholic versifications missing from the KJV file. It just lists of something absurd like "There are 4,741 missing verses" or whatever it is.
> 
> > Is it an advantage or disadvantage to be compiled against SWORD lib vs slurping header files?
> 
> Like most things, it's a trade-off. Working with the bindings requires that the Sword bindings are installed on the host system. For someone running on Windows, this is particularly non-trivial. For someone running in macOS it's not too difficult to install from source (I don't believe Homebrew builds them). For users of major Linux distributions, it's downright trivial. On Fedora it's as simple as a single `dnf install python3-sword` command for a long time now, and it looks like the bindings are also available for Ubuntu starting in 25.04 with an `apt install python3-sword` as well.

Regarding building SWORD on a Mac, I use homebrew for extra packages. I tried to run ./autogen.sh, but it failed on libtoolize, which homebrew doesn’t have. Then I ran cmake, which failed because icu4c required C++17 or better. Hacking that I got CMakeLists.txt, I got it to work. I’ll see if I can use that to run your script.

> Advantages of the binding method are that it doesn't rely on parsing a C header file, nor on the file laying out the values in a certain way. It also can be used offline easily, doesn't require parsing the output of HTML in order to find all the applicable files, and is likely slightly faster. Not that the speed probably matters for a single run of this, but if you're bulk processing files the speed advantages can add up.

The way I wrote mine is that it could use the include/canon*.h files from a prior local SVN clone. This is very fast. I’d be curious to see how it differs in speed from yours. The default is to go against the web, which is painfully slow. (Note, it doesn’t yet do the standard disclaimer for the web.) Not big deal if it is a single run. Peter mentioned that he does additional analysis of the files in problematic areas that cannot be done by the script.

Using the python bindings does have the advantages of not re-inventing the wheel. I was impressed with chatGPT’s regular expressions to slurp the arrays and how concise it was to read the files. There really wasn’t any difficulty in parsing the files. Since the canon*.h files are very static and not likely to affect the parse. I don’t think this is that big a deal.


> 
> Disadvantages of the binding method are that it's requiring you to revert back to a source build if you are using this to test a canon.h file or if you want to use a canon file that isn't available in the package manager of your Linux distribution. Building from source isn't terribly onerous for most of us contributors but it might be more of a problem for a module maintainer. Then again, how often do we add a new versification to the code base?

 So, it’s not something we’d expect a module maker to succeed at if not on Un*x. Maybe someone has a library release for the MacOS or Windows that could be used?

> 
> So there are pros and cons between them. I was freshly off of getting the bindings to compile when I wrote the first draft of av11n.py so I naturally went that direction. I also try to avoid writing parsers when I can leverage existing ones, as grammars can be notoriously complex to get correct. So that dictated my choices as much as did anything else, really!

My computer science masters degree was in compiler writing! It’s definitely not for the faint of heart!

> 
> Another possible enhancement might be a CLI flag to limit the testing range to a particular book (or testament) at a time. I have heard people talk about having modules split up to one book per file or similar. If they could say, "Only check this file against Joshua" then it could keep down a significant amount of extra output. But again - I'm not really an intended user of it!

Great idea. So David’s suggestion of a scope argument.

And I’m not an intended user of it either. I’m just trying to get people to use something other than osis2mod to pick a versification. Looking at the Jira issues on osis2mod, in one issue a person listed their script that looped over the v11ns and called osis2mod with each. Yuck!

> 
> --Greg
> 
>> 
>> DM
>> 
>>> On Jun 19, 2025, at 12:12 AM, Greg Hellings <greg.hellings at gmail.com <mailto:greg.hellings at gmail.com>> wrote:
>>> 
>>> And here's an example now that I've fixed the output of the osisIDs when there are fewer than 100 of them:
>>> 
>>> [vagrant at localhost ~]$ ./av11n.py kjv.osis.xml                                                                                 
>>>                                                                                                                                
>>> Checking Calvin:
>>> ----------------   
>>>         The following IDs don’t appear in your file:
>>> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15, Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37, Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35, Job.39.36, Job.39.37, Job.39.38
>>> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53, Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14, Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9, Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
>>> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27, Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20, Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24, Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
>>> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14, Ps.67.8, Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13, Ps.77.21, Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14, Ps.88.19, Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
>>>         There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.
>>>         The following IDs don’t appear in v11n:                                                                                
>>> %s 1Kgs.22.54, 1Sam.20.43, 1Sam.24.23, 3John.1.15, Acts.24.28, Eccl.12.15, Eccl.12.16, Ezek.21.33, Ezek.21.34, Ezek.21.35, Ezek.21.36, Ezek.21.37, Hos.12.15, Isa.8.23, Job.39.31, Job.39.32, Job.39.33, Job.39.34, Job.39.35, Job.39.36, Job.39.37, Job.39.38
>>> , Job.40.25, Job.40.26, Job.40.27, Job.40.28, Jonah.2.11, Mark.10.53, Mark.9.51, Num.13.34, Num.30.17, Ps.102.29, Ps.108.14, Ps.12.9, Ps.140.14, Ps.142.8, Ps.18.51, Ps.19.15, Ps.20.10, Ps.21.14, Ps.22.32, Ps.3.9, Ps.30.13, Ps.31.25, Ps.34.23, Ps.36.13, P
>>> s.38.23, Ps.39.14, Ps.4.9, Ps.40.18, Ps.41.14, Ps.42.12, Ps.44.27, Ps.45.18, Ps.46.12, Ps.47.10, Ps.48.15, Ps.49.21, Ps.5.13, Ps.51.20, Ps.51.21, Ps.52.10, Ps.52.11, Ps.53.7, Ps.54.8, Ps.54.9, Ps.55.24, Ps.56.14, Ps.57.12, Ps.58.12, Ps.59.18, Ps.6.11, Ps
>>> .60.13, Ps.60.14, Ps.61.9, Ps.62.13, Ps.63.12, Ps.64.11, Ps.65.14, Ps.67.8, Ps.68.36, Ps.69.37, Ps.7.18, Ps.70.6, Ps.75.11, Ps.76.13, Ps.77.21, Ps.8.10, Ps.80.20, Ps.81.17, Ps.83.19, Ps.84.13, Ps.85.14, Ps.88.19, Ps.89.53, Ps.9.21, Ps.92.16, Rev.12.18
>>>         There are 1 OT IDs and 29 NT IDs in your file which don’t appear in v11n.
>>> 
>>> 
>>> On Wed, Jun 18, 2025 at 11:00 PM Greg Hellings <greg.hellings at gmail.com <mailto:greg.hellings at gmail.com>> wrote:
>>>> Here is an example of the first lines of running my script against the kjv.osis.xml file from the git repo:
>>>> 
>>>> 
>>>> Checking Calvin:
>>>> ----------------
>>>>         There are 93 OT IDs and 5 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 30 NT IDs in your file which don’t appear in v11n.
>>>> 
>>>> Checking Catholic:
>>>> ------------------
>>>>         There are 4530 OT IDs and 3 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 133 NT IDs in your file which don’t appear in v11n.
>>>> 
>>>> Checking Catholic2:
>>>> -------------------
>>>>         There are 4638 OT IDs and 3 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 133 NT IDs in your file which don’t appear in v11n.
>>>> 
>>>> Checking DarbyFr:
>>>> -----------------
>>>>         There are 31 OT IDs and 4 NT IDs in v11n which aren’t in your file.
>>>>         There are 0 OT IDs and 30 NT IDs in your file which don’t appear in v11n.
>>>> 
>>>> This continues on to include such output as
>>>> 
>>>>                                                                                                                                
>>>> Checking KJV:
>>>> ------------- 
>>>>         Your file has all the references in this v11n
>>>>         Your file has no extra references                                                                                      
>>>>                                                                                                                                
>>>> Checking KJVA:         
>>>> --------------
>>>>         There are 5717 OT IDs and 0 NT IDs in v11n which aren’t in your file.
>>>>         Your file has no extra references
>>>> 
>>>> giving a clear example of a winner for this particular file.
>>>> 
>>>> Meanwhile, running it against the kjva.osis.xml file includes this in the results:
>>>> 
>>>> ...
>>>> 
>>>> Checking KJV:        
>>>> -------------        
>>>>         Your file has all the references in this v11n
>>>>         There are 2 OT IDs and 5715 NT IDs in your file which don’t appear in v11n.
>>>>                                                                
>>>> Checking KJVA:                                                                                                                 
>>>> --------------                                                                                                                 
>>>>         Your file has all the references in this v11n
>>>>         Your file has no extra references
>>>> ...
>>>> 
>>>> Fiddling with the file has showed me there are a couple of places where I need to tweak it for Python 3 compatibility that I missed the last time I updated. But fixing those couple of little syntax issues resulted in it running just fine in a Fedora 41 vm with nothing more to do than invoke `dnf install python3-sword` to setup the system to use it.
>>>> 
>>>> --Greg
>>>> 
>>>> On Wed, Jun 18, 2025 at 10:40 PM Greg Hellings <greg.hellings at gmail.com <mailto:greg.hellings at gmail.com>> wrote:
>>>>> My script eschews percentages because they seemed relatively pointless to me for measuring a mismatch like this. Instead it gives a count of both Old and New Testament osisIDs that it finds missing and another that it finds unexpectedly for a given versification. If the total of either count is fewer than 100, the IDs for that particular count are printed to the console. It will do this for every registered versification in the version of the library it was compiled against, allowing the user to select whichever one seems best to them based on the results.
>>>>> 
>>>>> On Wed, Jun 18, 2025, 10:25 PM David Haslam <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>> It’s not just the number of “missing” verses that should figure in the percentage score, but also the number of verses that get concatenated to the last one in a chapter.
>>>>>> 
>>>>>> The differences in v11n for the Psalms will be especially significant for this, in that some v11n renumber many of them. Likewise for the last few chapters in the book of Job.
>>>>>> 
>>>>>> Aside: It would be cool to enhance the utility emptyvss by providing a command line option that would ignore books that are not included in the scope parameter in the conf file.
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> On Thu, Jun 19, 2025 at 03:18, DM Smith <dmsmith at crosswire.org <mailto:On+Thu,+Jun+19,+2025+at+03:18,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>>>> 
>>>>>>> David,
>>>>>>> 
>>>>>>> Because it only considers the xml, scope is automatically built into it. It is only comparing what is present in the xml with what is part of the av11ns. 
>>>>>>> 
>>>>>>> It might be good to add the enumeration of missing verses.
>>>>>>> 
>>>>>>> — DM
>>>>>>> 
>>>>>>>> On Jun 18, 2025, at 4:02 PM, David Haslam <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Does it take account of the Scope key in the .conf file for a less than complete Bible ?
>>>>>>>> 
>>>>>>>> David
>>>>>>>> 
>>>>>>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Jun 18, 2025 at 20:51, DM Smith < dmsmith at crosswire.org <mailto:On+Wed,+Jun+18,+2025+at+20:51,+DM+Smith+%3C%3Ca+href=>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Several have commented on how hard it is to test an OSIS xml file against v11ns especially since it goes off into an infinite loop. (I’ve posted a patch that fixes that) But it is still a process of trial and error to find an appropriate v11n.
>>>>>>>>> 
>>>>>>>>> So, I’ve been iterating with chatGPT to create a python script to find a best fit v11n. Since I don’t know python, I can’t vouch for the script beyond it worked for a simple test case that had an extra chapter for Genesis and had some extra verses at the end of a chapter in that book.
>>>>>>>>> 
>>>>>>>>> I offer it, as a starting place. See the attached file.
>>>>>>>>> 
>>>>>>>>> It has a —debug flag.
>>>>>>>>> The first argument is expected to be the OSIS xml file.
>>>>>>>>> The second argument is optional and gives the location to the include directory of svn/sword/trunk/include with all the canon*.h files. If you don’t supply the argument, it uses the web to load the canon*.h files from https://www.crosswire.org/svn/sword/trunk/include. 
>>>>>>>>> 
>>>>>>>>> It will score the fitness of each of the v11ns. It gives the score as a %, but I don’t know what that means. I told it that it should prioritize book matches, then chapter matches and finally verse matches. I don’t know how well it did that scoring. I didn’t test for that.
>>>>>>>>> 
>>>>>>>>> The output is alphabetized. If more than one v11n have the same high score, they are listed.
>>>>>>>>> 
>>>>>>>>> In His Service,
>>>>>>>>>  DM
>>>>>>>>> 
>>>>>>>> _______________________________________________ 
>>>>>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org> 
>>>>>>>> http://crosswire.org/mailman/listinfo/sword-devel 
>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above page
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>> http://crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250619/41bc5430/attachment-0001.htm>


More information about the sword-devel mailing list