[sword-devel] Proposal for a new SWORD filter to display word dividers
David Haslam
dfhdfh at protonmail.com
Fri May 2 12:28:44 EDT 2025
So I said to myself "What a beautiful world!" "Walk before you run, David!"
My walking stage was to replace all the ZWSP by <milestone marker="" type="x-lexical-word-divider" subtype="x-ZWSP"/>
My running stage would've been to replace all the ZWSP by this [sans bullets/EOLs]: (where the first marker is a ZWSP)
- <seg type="x-variant" subType="x-1"><milestone marker="" type="x-lexical-word-divider" subtype="x-ZWSP"/></seg>
- <seg type="x-variant" subType="x-2"><milestone marker="·" type="x-lexical-word-divider" subtype="x-MDOT"/></seg>
This complex kludge was to emulate the proposed new filter by using GlobalOptionFilter=OSISVariants
I suppose the concept might simplified as follows, but this would be less self-documenting:
- <seg type="x-variant" subType="x-1" marker=""/><seg type="x-variant" subType="x-2" marker="·"/>
This assumes that the marker attribute is valid for use in the seg element. Is that so?
Best regards,
David
Sent with [Proton Mail](https://pr.tn/ref/SWXT9A5YZ67G) secure email.
On Friday, May 2nd, 2025 at 4:25 PM, Peter von Kaehne <refdoc at gmx.net> wrote:
> Then, if a module offers the option striptext may need to introduce a space prior indexing.
>
> Sent from [Outlook for iOS](https://aka.ms/o0ukef)
> ---------------------------------------------------------------
>
> From: sword-devel <sword-devel-bounces at crosswire.org> on behalf of David Haslam <dfhdfh at protonmail.com>
> Sent: Friday, May 2, 2025 4:21 pm
> To: sword-devel mailing list <sword-devel at crosswire.org>
> Cc: David Haslam <df.haslam at btinternet.com>
> Subject: Re: [sword-devel] Proposal for a new SWORD filter to display word dividers
>
> Thanks DM,
>
> Then we have a serious problem in SWORD that Peter’s initial feedback failed to foresee.
>
> Is anything to be done about this ?
>
> Aside: WWJD = What Would JSword Do ?
>
> David
>
> Sent from [Proton Mail](https://proton.me/mail/home) for iOS
>
> On Fri, May 2, 2025 at 15:10, DM Smith <[dmsmith at crosswire.org](mailto:On Fri, May 2, 2025 at 15:10, DM Smith <<a href=)> wrote:
>
>> Fast Search could benefit from this, but SWORD uses plain text as an input to Lucene index creation and searching. The implementation of the Lucene analyzer that SWORD uses for all texts uses ASCII space as a word break. The presence of various zero width characters would not help. Nor would using <w>…</w><w>…</w> without an ASCII space.
>>
>> Plain text, also called strip text, does double duty. Presentation without markup and preparation for indexing. For latinate texts, this works fine. I don’t know if LocalStripFilter could help in this.
>>
>> In Him,
>> DM
>>
>>> On May 1, 2025, at 7:46 AM, Peter von Kaehne <refdoc at gmx.net> wrote:
>>>
>>> I think this is not difficult per se, but it should be properly encoded.
>>>
>>> <w> seems correct, using zero with characters seems not correct.
>>>
>>> Peter
>>>
>>> Sent from [Outlook for iOS](https://aka.ms/o0ukef)
>>> ---------------------------------------------------------------
>>>
>>> From: sword-devel <sword-devel-bounces at crosswire.org> on behalf of David Haslam <dfhdfh at protonmail.com>
>>> Sent: Thursday, May 1, 2025 11:30 am
>>> To: sword-devel mailing list <sword-devel at crosswire.org>
>>> Cc: David Haslam <df.haslam at btinternet.com>
>>> Subject: [sword-devel] Proposal for a new SWORD filter to display word dividers
>>>
>>> I wish to propose that we design in a new SWORD filter.
>>>
>>> The conf key would be:
>>>
>>> - GlobalOptionFilter=ShowWordDividers
>>>
>>> In the writing systems for the various languages of SE Asia ( Thai, Khmer, Lao, Myanmar) there is [generally] no space between words.
>>>
>>> In this respect, they are like many European languages before the start of [silent reading](https://www.amazon.com/Space-Between-Words-Origins-Medieval/dp/080474016X). The descriptive term is Scriptura Continua.
>>>
>>> Some Bible translations for this region are already making use of one of the ZERO WIDTH characters to invisibly mark the divisions between lexical words.
>>> Options include:
>>>
>>> - U+200B ZERO WIDTH SPACE
>>> - U+200C ZERO WIDTH NON-JOINER
>>> - U+FEFF ZERO WIDTH NO BREAK SPACE
>>>
>>> They exclude:
>>>
>>> - U+200D ZERO WIDTH JOINER
>>>
>>> A further possibility, even without requiring a full study Bible with Strong's, etc, is to simply wrap each lexical word within the OSIS w element.
>>> One without any OSIS attributes would suffice for this purpose. Likewise, for the seg element.
>>>
>>> My proposal is that we design a feature to show/hide word dividers by displaying them using a suitable visible but non-intrusive character.
>>> My suggestion is to use this Unicode character by default:
>>>
>>> - U+00B7 MIDDLE DOT
>>>
>>> We could even allow the actual visible character to be specified in a second conf key, thus:
>>>
>>> - VisibleWordDivider=U+00B7
>>>
>>> Benefits would include:
>>>
>>> - Helps with language learning to know where lexical words start and end
>>> - Helps with front-end search for whole words, exact phrase or all words
>>> - Helps with checking the accuracy of Bible translations by clearly displaying lexical word boundaries at the touch of a single key in the front-end
>>> - Paves the way for Study Bible with the addition of Strong's mark-up, etc.
>>>
>>> Here's a sample of Khmer verse text with the MIDDLE DOT as the visible word divider:
>>>
>>>> Obad.1.1
>>>> នេះ·ជា·សុបិន·និមិត្ដ·របស់·លោក·អូបាឌា ព្រះអម្ចាស់·ជា·ព្រះ·មាន·បន្ទូល·ពី·ក្រុង·អេដំម ។ យើង·បាន·ឮ·ដំណឹង·មក·ពី·ព្រះអម្ចាស់ គឺ·មាន·ទូត·ម្នាក់·បាន·បញ្ជូន·ឲ្យ·ទៅ ក្នុង·ចំណោម·ជន·ជាតិ·ទាំង·ឡាយ·ដោយ·ពាក្យ·ថា "ចូរ·ក្រោក·ឡើង ! ចូរ·យើង·ក្រោក·ឡើង·ធ្វើ·ចម្បាំង·ទាស់·និង·គេ"
>>>
>>> cf. Here's what it looks like with the ZWSP as the in visible word divider:
>>>
>>>> Obad.1.1
>>>> នេះជាសុបិននិមិត្ដរបស់លោកអូបាឌា ព្រះអម្ចាស់ជាព្រះមានបន្ទូលពីក្រុងអេដំម ។ យើងបានឮដំណឹងមកពីព្រះអម្ចាស់ គឺមានទូតម្នាក់បានបញ្ជូនឲ្យទៅ ក្នុងចំណោមជនជាតិទាំងឡាយដោយពាក្យថា "ចូរក្រោកឡើង ! ចូរយើងក្រោកឡើងធ្វើចម្បាំងទាស់និងគេ"
>>>
>>> If SWORD developers agree that my proposal merits consideration, please would you start on the software development.
>>>
>>> Best regards,
>>>
>>> David
>>>
>>> Sent with [Proton Mail](https://pr.tn/ref/SWXT9A5YZ67G) secure email.
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250502/c2ce8f3d/attachment-0001.htm>
More information about the sword-devel
mailing list