[sword-devel] Proposal for a new SWORD filter to display word dividers

Michael H cmahte at gmail.com
Thu May 1 11:53:08 EDT 2025


I've been involved with translation projects (Thai) that let the
translators do their thing, with the idea to introduce encoding afterward.
The tendency with these languages is if you leave it to the translator,
there will be no mark between words, justlettersalltogetherinsequence,
which is normal in their world.  However, Bible Markup has verse units and
word units, and to get to word unit, you need SOMETHING.

What we used to introduce them after the fact was called (kukut, quecut?)
which introduced a hair space (because the typesetting program or paratext
abused other glyphs, but this was 2014) where a
dictionary/grammar algorhythm  suggested there should be a word break.

Bottom line is that a human saying where the words break is FAR better than
a computer.   You'll accelerate the completion of a project like this by
addressing it as early as possible and make the words break in the early
stages for proper markup, than by leaving it to a computer with a
dictionary (and we're talking about minority languages where that
dictionary is not even to an alpha level.)

However, when it came time to render the files into readable text on screen
and paper, we ultimately reverted to tagging them with word tags and
removing the fake zero width space, because regardless of the unicode point
we used, it turned into an actual character in the text stream, that
doubled the imposition stretch or squeeze that publishing programs do to
make lines justified, visually introducing assymetry into the text and
causing complaints.  More specifically, on lines that were tight (more
letters than average), the word breaks were smaller than normal, and people
complained.  But when the lines were loose (fewer letters than average),
the words had visible space between them, but I don't recall anyone
complaining. The complaints were universally about being too close.

So, from experience, using word tags is a lot more resilient across all
methods of using the text later, whether it's a sword module, or paper, or
epub or on a projector screen in a church.

And kukut also introduced unicode points that described where words could
break with a hyphen.  That got translated into a hyphenation database
similar to modern hunspell. By the time we finished the files, they had no
embedded hyphenation points for the same reason (the points would get
stretched and squeezed causing reader confusion.)

On Thu, May 1, 2025 at 8:22 AM Peter von Kaehne <refdoc at gmx.net> wrote:

> David, that is misreading what I said.
>
> If we want to create a new feature then it is a module makers
> responsibility to create the markup.
>
> The markup which lends itself to toggling display  is proper xml markup
>
> If there are modules which use ZWNJ or else currently then this is fine
> and good but in that form they can not and should not get the benefit of
> such a feature. They would require updating.
>
> Sent from Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *From:* David F. Haslam <df.haslam at btinternet.com>
> *Sent:* Thursday, May 1, 2025 1:40 pm
> *To:* Peter von Kaehne <refdoc at gmx.net>; SWORD Developers' Collaboration
> Forum <sword-devel at crosswire.org>
> *Subject:* Re: [sword-devel] Proposal for a new SWORD filter to display
> word dividers
>
> But we can help them towards that goal by making module development less
> onerous.
> Then perhaps they might use our derived module to help them check their
> translation at each stage
> without them having to keep asking us to rebuild the module for them with
> all the demanding file format transformations such a task entails.
> There's nothing that forbids us to accept a module containing ZWSP
> characters *per se*.
>
> And, btw, existing CrossWire module *KhmerNT* contains 223,198 ZWSP.
> Since it was released on 2012-02-15 nobody has batted an eyelid that it
> used this means to mark lexical word boundaries.
> Not you, not me, not anyone in the core development team.
> So it's a bit rich to say over 13 years later that "it is our job to ....
> and apply it".
>
> Our ministry as a Society should include actively assisting translators,
> not merely distributing their finished product.
>
> Aside: We've not heard from any other team members yet.
>
> David
>
> On 2025-05-01 13:07, Peter von Kaehne wrote:
>
> I would not expect any Bible translator to do anything.
>
> if they tell us they used whatever to mark up whatever then it is our job
> as module team tk take whatever and find the appropriate semantic mark up
> and apply it.
>
> This is not different.
>
> Peter
>
> Sent from Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *From:* sword-devel <sword-devel-bounces at crosswire.org>
> <sword-devel-bounces at crosswire.org> on behalf of David Haslam
> <dfhdfh at protonmail.com> <dfhdfh at protonmail.com>
> *Sent:* Thursday, May 1, 2025 12:59 pm
> *To:* SWORD Developers' Collaboration Forum <sword-devel at crosswire.org>
> <sword-devel at crosswire.org>
> *Cc:* David Haslam <df.haslam at btinternet.com> <df.haslam at btinternet.com>
> *Subject:* Re: [sword-devel] Proposal for a new SWORD filter to display
> word dividers
>
> Hi Peter,
>
> Undoubtedly, but we cannot demand or expect most Bible translators to be
> XML afficionados.
>
> It's even difficult to teach some members of a translation team to use the
> ZWSP properly.
>
> "If you cannot see it, key it again" can so easily become the *modus
> operandi*.
> Witness the following in the same chapter prior to my involvement.
> After I replaced all ZWSP by MIDDLE DOT, just look at the tangle!!!
> *See attached text file*.
>
> So we should do "belt and braces" to help the weak.
> Also called "going the extra mile". 😎
>
> But worry not. My feedback is already helping the Khmer translation team.
>
> Best regards,
>
> David
>
> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>
> On Thursday, May 1st, 2025 at 12:47 PM, Peter von Kaehne <refdoc at gmx.net>
> <refdoc at gmx.net> wrote:
>
> I think this is not difficult per se, but it should be properly encoded.
>
> <w> seems correct, using zero with characters seems not correct.
>
> Peter
>
> Sent from Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *From:* sword-devel <sword-devel-bounces at crosswire.org>
> <sword-devel-bounces at crosswire.org> on behalf of David Haslam
> <dfhdfh at protonmail.com> <dfhdfh at protonmail.com>
> *Sent:* Thursday, May 1, 2025 11:30 am
> *To:* sword-devel mailing list <sword-devel at crosswire.org>
> <sword-devel at crosswire.org>
> *Cc:* David Haslam <df.haslam at btinternet.com> <df.haslam at btinternet.com>
> *Subject:* [sword-devel] Proposal for a new SWORD filter to display word
> dividers
>
> I wish to propose that we design in a new SWORD filter.
>
> The conf key would be:
>
>    - *GlobalOptionFilter=ShowWordDividers*
>
>
> In the writing systems for the various languages of SE Asia (*Thai*,
> *Khmer*, *Lao*, *Myanmar*) there is [generally] *no space between words*.
>
> In this respect, they are like many European languages before the start of silent
> reading
> <https://www.amazon.com/Space-Between-Words-Origins-Medieval/dp/080474016X>.
> The descriptive term is *Scriptura Continua*.
>
> Some Bible translations for this region are already making use of one of
> the ZERO WIDTH characters to invisibly mark the divisions between lexical
> words.
> Options include:
>
>    - U+200B ZERO WIDTH SPACE
>    - U+200C ZERO WIDTH NON-JOINER
>    - U+FEFF ZERO WIDTH NO BREAK SPACE
>
> They exclude:
>
>    - U+200D ZERO WIDTH JOINER
>
> A further possibility, even without requiring a full study Bible with
> Strong's, etc, is to simply wrap each lexical word within the OSIS *w*
>  element.
> One without any OSIS attributes would suffice for this purpose. Likewise,
> for the *seg* element.
>
> My proposal is that we design a feature to *show/hide word dividers* by
> displaying them using a suitable visible but non-intrusive character.
> My suggestion is to use this Unicode character by default:
>
>
>    - U+00B7 MIDDLE DOT
>
>
> We could even allow the actual visible character to be specified in a
> second conf key, thus:
>
>
>    - VisibleWordDivider=U+00B7
>
>
> Benefits would include:
>
>    - Helps with language learning to know where lexical words start and
>    end
>    - Helps with front-end search for whole words, exact phrase or all
>    words
>    - Helps with checking the accuracy of Bible translations by clearly
>    displaying lexical word boundaries at the touch of a single key in the
>    front-end
>    - Paves the way for Study Bible with the addition of Strong's mark-up,
>    etc.
>
>
> Here's a sample of Khmer verse text with the MIDDLE DOT as the visible
> word divider:
>
> *Obad.1.1*
> នេះ·ជា·សុបិន·និមិត្ដ·របស់·លោក·អូបាឌា
> ព្រះអម្ចាស់·ជា·ព្រះ·មាន·បន្ទូល·ពី·ក្រុង·អេដំម ។
> យើង·បាន·ឮ·ដំណឹង·មក·ពី·ព្រះអម្ចាស់ គឺ·មាន·ទូត·ម្នាក់·បាន·បញ្ជូន·ឲ្យ·ទៅ
> ក្នុង·ចំណោម·ជន·ជាតិ·ទាំង·ឡាយ·ដោយ·ពាក្យ·ថា "ចូរ·ក្រោក·ឡើង !
> ចូរ·យើង·ក្រោក·ឡើង·ធ្វើ·ចម្បាំង·ទាស់·និង·គេ"
>
>
> cf. Here's what it looks like with the ZWSP as the invisible word divider:
>
> *Obad.1.1*
> នេះ​ជា​សុបិន​និមិត្ដ​របស់​លោក​អូបាឌា
> ព្រះអម្ចាស់​ជា​ព្រះ​មាន​បន្ទូល​ពី​ក្រុង​អេដំម ។
> យើង​បាន​ឮ​ដំណឹង​មក​ពី​ព្រះអម្ចាស់ គឺ​មាន​ទូត​ម្នាក់​បាន​បញ្ជូន​ឲ្យ​ទៅ
> ក្នុង​ចំណោម​ជន​ជាតិ​ទាំង​ឡាយ​ដោយ​ពាក្យ​ថា "ចូរ​ក្រោក​ឡើង !
> ចូរ​យើង​ក្រោក​ឡើង​ធ្វើ​ចម្បាំង​ទាស់​និង​គេ"
>
>
> If SWORD developers agree that my proposal merits consideration, please
> would you start on the software development.
>
>
> Best regards,
>
> David
>
> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250501/63a0c0c3/attachment-0001.htm>


More information about the sword-devel mailing list