[sword-devel] Script to find a best fit v11n
Greg Hellings
greg.hellings at gmail.com
Thu Jun 19 23:20:49 EDT 2025
On Thu, Jun 19, 2025 at 4:13 PM DM Smith <dmsmith at crosswire.org> wrote:
>
> On Jun 19, 2025, at 3:24 PM, Greg Hellings <greg.hellings at gmail.com>
> wrote:
>
>
> Like most things, it's a trade-off. Working with the bindings requires
> that the Sword bindings are installed on the host system. For someone
> running on Windows, this is particularly non-trivial. For someone running
> in macOS it's not too difficult to install from source (I don't believe
> Homebrew builds them). For users of major Linux distributions, it's
> downright trivial. On Fedora it's as simple as a single `dnf install
> python3-sword` command for a long time now, and it looks like the bindings
> are also available for Ubuntu starting in 25.04 with an `apt install
> python3-sword` as well.
>
>
> Regarding building SWORD on a Mac, I use homebrew for extra packages. I
> tried to run ./autogen.sh, but it failed on libtoolize, which homebrew
> doesn’t have. Then I ran cmake, which failed because icu4c required C++17
> or better. Hacking that I got CMakeLists.txt, I got it to work. I’ll see if
> I can use that to run your script.
>
For these purposes, neither ICU nor CLucene are needed. It's only pulling
the versification data which is core to the library's builtins.
>
> Advantages of the binding method are that it doesn't rely on parsing a C
> header file, nor on the file laying out the values in a certain way. It
> also can be used offline easily, doesn't require parsing the output of HTML
> in order to find all the applicable files, and is likely slightly faster.
> Not that the speed probably matters for a single run of this, but if you're
> bulk processing files the speed advantages can add up.
>
>
> The way I wrote mine is that it could use the include/canon*.h files from
> a prior local SVN clone. This is very fast. I’d be curious to see how it
> differs in speed from yours. The default is to go against the web, which is
> painfully slow. (Note, it doesn’t yet do the standard disclaimer for the
> web.) Not big deal if it is a single run. Peter mentioned that he does
> additional analysis of the files in problematic areas that cannot be done
> by the script.
>
> Using the python bindings does have the advantages of not re-inventing the
> wheel. I was impressed with chatGPT’s regular expressions to slurp the
> arrays and how concise it was to read the files. There really wasn’t any
> difficulty in parsing the files. Since the canon*.h files are very static
> and not likely to affect the parse. I don’t think this is that big a deal.
>
Yeah, the canon header files are pretty well structured following a
standard format to make it easier on humans, and thus regex, to swallow.
The thought of doing so had simply never crossed my mind.
>
>
>
> Disadvantages of the binding method are that it's requiring you to revert
> back to a source build if you are using this to test a canon.h file or if
> you want to use a canon file that isn't available in the package manager of
> your Linux distribution. Building from source isn't terribly onerous for
> most of us contributors but it might be more of a problem for a module
> maintainer. Then again, how often do we add a new versification to the code
> base?
>
>
> So, it’s not something we’d expect a module maker to succeed at if not on
> Un*x. Maybe someone has a library release for the MacOS or Windows that
> could be used?
>
Because our Python bindings are built as part of the library and generated
by Swig, they aren't distributed onto PyPI (the PYthon Package Index),
which is the standard way of installing third party Python modules. To
install from PyPI, one simply uses the "pip" tool, or other standard Python
package installers. But for ours, the binding code is generated by Swig
from the library code and we don't then subsequently distribute the module
separately. Doing so would not be terribly difficult, but it is not a route
we have taken previously.
Of course, installing Python modules that include C bindings necessitates
having the Python.h file available as well as a compatible version of a C
compiler. For the official Python distributions, this is always and only
MSVC - or at least it has been in the past. Officially Python has not
historically even supported building for Windows with gcc. It's enough of a
bugbear that I've never even bothered with installing modules with Python
on Windows.
Nowadays, though, we don't really need to. Anyone who wants to can install
Ubuntu under the WSL and just take advantage of the existing apt package
and Python in there. As for macOS, I haven't a good solution there. I only
use it as demanded for work. Probably best to just let people who want it
compile it from source there, and let them know there isn't any need for
the ICU add ons.
An alternative is to go beyond a Python script and create a full utility in
C that does this same work. That would make distribution much easier to all
of the platforms. The reason I did not initially take that route is that
Python is so convenient for working with XML in whereas the library has no
such mechanism to readily parse it and query in the same way. Obviously it
can be done, as osis2mod is already doing that work. Its parsing code could
be repurposed to this effect.
>
> So there are pros and cons between them. I was freshly off of getting the
> bindings to compile when I wrote the first draft of av11n.py so I naturally
> went that direction. I also try to avoid writing parsers when I can
> leverage existing ones, as grammars can be notoriously complex to get
> correct. So that dictated my choices as much as did anything else, really!
>
>
> My computer science masters degree was in compiler writing! It’s
> definitely not for the faint of heart!
>
Mine was in AI. Also not for the faint of heart, but much more approachable
of a consumer product.
>
>
> Another possible enhancement might be a CLI flag to limit the testing
> range to a particular book (or testament) at a time. I have heard people
> talk about having modules split up to one book per file or similar. If they
> could say, "Only check this file against Joshua" then it could keep down a
> significant amount of extra output. But again - I'm not really an intended
> user of it!
>
>
> Great idea. So David’s suggestion of a scope argument.
>
Basically, yes.
>
> And I’m not an intended user of it either. I’m just trying to get people
> to use something other than osis2mod to pick a versification. Looking at
> the Jira issues on osis2mod, in one issue a person listed their script that
> looped over the v11ns and called osis2mod with each. Yuck!
>
Yeah, using osis2mod in that way seems fraught with trouble. But in the
absence of knowing more about the internal of the library, I can see why
someone would take that approach.
--Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250619/11e87da2/attachment.htm>
More information about the sword-devel
mailing list