Suggestions for Sane OSIS Scripture Sharing

A draft document by Kahunapule Michael Johnson (13 July 2007)

Introduction

Accelerating and improving the Bible translation and publication process with computers requires, among other things, encoding the Scriptures in some sort of digital format. The history of computer-aided Bible translation and publication is filled with diverse solutions to this problem. Each of these ways of encoding the text of the Holy Bible has advantages and disadvantages, and they vary in which particular applications and languages they are best suited for. Unfortunately, the diversity of formats cause problems in sharing, reusing, archiving, and processing data. The diversity of Scripture file formats spawns a diverse set of programs which cannot share data with each other. The Open Scripture Information Standard (OSIS) is one attempt to create a unified format for encoding Scriptures and Scripture-related information in an XML schema that can be used by many people in many applications with many different requirements. OSIS is very flexible. It can be used in more ways and with greater variations than “Standard Format Markers” SFM backslash codes. This is both a blessing and a curse. The blessing is that there is one XML Scripture encoding format that at least appears to do everything we could ever want it to, and more. The curse is that it can be used in so many incompatible ways, it is really no more a single format than the many different dialects of SFM were before USFM came along.

To truly reap the potential benefits of OSIS for Bible translation and publication applications, it is necessary to agree upon some additional guidelines concerning both the syntax and semantics of the subset of OSIS we use. Work has been started on defining an “OSIS Best Practice” standard for use in Bible translation. That work, so far, has focused on restricting the syntax of OSIS by stripping out less-useful tags. This is a good start. This document focuses more on the semantics of the tags within the OSIS Best Practice subset, the way they should be used, and the meanings associated with those uses. It is an attempt to answer the question “What is the best way to use OSIS for Bible translation and publication?”

This document assumes that you are familiar with the OSIS schema and OSIS Users Manual already, so you may want to look at those, first, if you haven't already.

Goals and Definitions

It is very helpful in building a consensus around any proposed standard if we can agree where we are trying to go and what we are trying to accomplish.

Reusability – we want our Scripture files to be usable for multiple applications without significant manual reformatting, such as different printed editions (standard, pocket testament, large print, etc.), Bible study software, Internet publication, future revisions, related language adaptations, etc. This is the main reason that we use markup (like OSIS and USFM) that marks text with the function of a style (like “section heading” or “Name of Diety”) instead of presentation details that vary with different uses of the same Bible (like “10-point Helvetica bold, centered” or “small caps”) .

Interchange – being able to pass Scripture translations between applications with no loss of essential information. Although this is not a concern for many users of OSIS, because their OSIS texts are both written and read by the same application, interchange is a requirement for any Scripture file format to be truly useful in the context of Bible translation and publication, at least insofar as OSIS is used to represent actual Scriptures. We expect that an OSIS Scripture file written by one application to be readable by any other OSIS-compatible application without loss of any essential information, as long as the OSIS used is within the subset we care about. In discussion of interchange considerations, I will refer to OSIS writers and OSIS readers. By those terms, I mean software applications that write or read OSIS-compliant files.

Archive – being able to pass Scripture translations between current applications and future applications decades from now with no loss of essential information and no significant manual processing. Commercial word processing and typesetting applications are notoriously bad in this respect, as file formats and applications that read (or used to read) them change frequently and sometimes become totally obsolete and unavailable. Therefore, we need a well-thought-out format that we believe will be readable far into the future, even if that means that we would have to write some simple software to read it.

Implementability – being able to build software that reads and writes OSIS files in a reasonable amount of time with minimal ambiguity and no errors that affect data integrity. This is the driving requirement behind defining a sensible subset rather than trying to implement the entire OSIS standard, especially in OSIS readers.

Common Bible text style support – being able to mark up all of the types of text normally found in Bible translations for minority languages, without spending significant resources supporting text styles not normally used in our work.

Flexibility – being able to accommodate exceptions and unusual situations in ways that minimize problems with existing applications that don't know about the specific exception being used.

Note that none of the goals stated above involve direct human editing of the OSIS text. OSIS is simply not suitable for that sort of thing. OSIS is also not well-suited for many computer applications without some sort of transformation from OSIS to another application-specific format and back. That is OK. OSIS is designed to be written and read by computer programs that are used for a wide variety of purposes. Although OSIS is suitable for encoding many kinds of documents, we will limit our discussion here to the use of OSIS for representing minority-language Bible translations.

General Recommendations

These general recommendations are really the core of a subset of OSIS that should be optimal for representing Bibles and Bible portions in the context of minority language (and some majority language) Bible translation and publication. A few of these recommendations are discussed in greater detail in sections of their own, later.

  1. All new or revised OSIS documents must validate against the schema at http://www.bibletechnologies.net/osisCore.2.1.1.xsd

  2. All new or revised OSIS documents should comply with the recommendations and instructions given in the OSIS Users Manual (http://bibletechnologies.net/utilities/fmtdocview.cfm?id=28871A67-D5F5-4381-B22EC4947601628B&method=title) to the maximum extent practical.

  3. OSIS writers should use verse markup as containers of verse contents instead of as milestones, making the Book/Chapter/Verse hierarchy primary in the XML of OSIS. OSIS readers should accept verse containers and verse milestones as equally valid. (See the OSIS Users Manual for an explanation of these options.) This, of course, implies that markup for paragraphs that start or stop within a verse instead of between verses must be marked up in milestone form.

  4. OSIS writers should not allow any text attributes to cross verse boundaries. Stop and restart such attributes as necessary with each verse. For example, if you choose to mark the words of Jesus Christ, each verse of the Sermon on the Mount would have its own <q who='Jesus' marker=''> ... </q marker=''> markup instead of just one at the beginning and one at the end of the sermon. OSIS readers must read OSIS so marked, and should ideally also be able to handle such attributes crossing verse boundaries.

  5. Be consistent in the use of the markup you use. For example, if you choose to mark quotations of Jesus to facilitate the possible production of a red letter edition, then mark them all accurately. If you don't mark them at all, anticipating no desire for such an edition, that is OK, too, but don't just mark some and not others. (Marking Jesus' words does not require that they be displayed or printed any differently. It just gives the publisher the option of using a different color ink in those quotations or not.)

  6. Use all of the tags you need to properly represent your text, and no more.

  7. Restrict your use of tags to those that correspond to USFM tags.

  8. Specify all quotation punctuation explicitly in the text of the Bible or in marker attributes of q elements. Specify an empty marker element in any q element if you don't have any quotation punctuation that should be inserted with that q element.

  9. Use only the set of OSIS tags that correspond to USFM tags, unless you have an excellent reason to use an additional tag.

  10. Keep the XML tree depth and OSIS tag vocabulary as simple practical for the given text.

  11. OSIS writers should keep the length of sID/eID strings under 32 characters in length. OSIS readers should not care what they actually are, as long as corresponding markers are identical and non-corresponding markers are not.

  12. Use the canonical="true" attribute on Psalm titles.

Why Use Verse Containers?

The short answer as to why we should use OSIS verse elements as containers instead of milestone pairs is that this maximizes compatibility with applications that use completed OSIS-encoded Bible translations. This approach also maps reasonably close to the way Paratext and Translation Editor represent Scripture text styles internally.

The OSIS standard deals with the fact that Bible texts have multiple overlapping hierarchies that don't nest nicely. Quotations, paragraphs, stanzas, and poetry verses, and Bible verses can and do overlap in ways that don't conform to the strict tree-structured view of the world enforced by XML. OSIS solves this problem by allowing any elements that could overlap to be used either as containers or as pairs of milestones. Milestone pairs are implemented as pairs of empty XML elements with a sID attribute on the first one and an eID attribute on the end one. The two milestones are unambiguously matched by making the text of those attributes match each other and no other sID or eID in the document. There are many different ways of generating unique matching identifier strings for sID and eID. My favorite way is to just use the OSIS ID of the starting verse in the set, concatenated with a dot and a count of identifiers started in that verse. All a reader should check, however, is that the contents of the sID and its corresponding eID match. In most cases, the contents of the sID and eID attributes are entirely redundant, but OSIS writers must generate them properly in case an OSIS reader relies on them. Naturally, having two different ways of doing things, one of which is very XML-like (containers) and one which is not (milestone pairs) complicates processing of the text, especially with XML tools like XMLT. Being consistent in the way the verses are written allows some simplification of processing of OSIS texts compared to the complexity of handling every case that is allowed by the wider OSIS specification.

Although we prefer to think of Scriptures in terms of syntactic units like paragraphs and sentences, the majority of computer-based applications for Scriptures address Scriptures as a database of verses. From that database, it should be possible to reconstruct the text, with all significant attributes and styles, without reference to prior or later verses. This is easier to do when (1) verse elements are used as containers, and (2) any significant text styles that cross verse boundaries are stopped and restarted at verse boundaries, as is the current practice with Paratext and USFM. Note that the encoding with a verse-priority scheme in no way constrains how any application presents Scripture translations for viewing or editing, so this is not a problem for any of our uses. If it were necessary for some reason, OSIS encoded as verse-priority instead of paragraph-priority OSIS could be re-encoded the other way using an XSLT or other process. It is more likely, however, that OSIS texts would be imported to and exported from some other internal representation used in an application, such as the database format of Translation Editor.

Quotation Quandary

There are at least three ways of using quotation markup associated with punctuation within OSIS-encoded Scriptures:

  1. Using OSIS markup and a language-specific process and language-specific style sheet to generate quotation punctuation from markup. On input, all quotations are marked at the beginning and end only with q elements. All existing marker attributes and milestones of type “cQuote” are removed and replaced with markup showing what the actual punctuation should be. New milestone elements of type “cQuote” and with marker elements containing the actual punctuation are generated for open quote reminders at the beginnings of paragraphs or stanzas, where appropriate. This is only something that should be done by the translators, and OSIS texts marked up in this way should not be passed on to others for use of the Scriptures without first converting to another method of handling quotations. This use is not compatible with the use of the q marker for marking quotations just for alternate rendition of text style (like Old Testament quotes in the New Testament or Words of Jesus for red letter editions) unless the process specifically recognizes those additional uses of the q marker and skips them. OSIS files marked this way must not be released to people other than the translators without running this process, so that marker attributes all accurately indicate what actual quotation punctuation should be included. Quotation punctuation is not appropriate for your translation and q elements are used, then you must include an empty marker attribute with each q element. The primary advantage of this process is that quotation nesting rules can be handled by the computer, and changes propagated automatically when quotation markup is inserted or deleted, or when paragraph devisions are inserted or moved. The end result of the process is markup that reflects preferred practice for new projects.

  2. Converting existing projects from SFM or other formats, where <<, >>, <, and > are used to represent double and single typographic quotation marks. If the angle bracket markup is used consistently, it is possible to generate the same kind of markup as above from the input. This is the preferred approach in this case.

  3. Converting existing projects from USFM or other formats, where quotation punctuation appropriate to the language is already in the text of the file. In this case, there may be ambiguity in the meaning of the right single typographic quote/apostrophe/glottal stop or possibly other punctuation. In this case, it is usually better to leave the quotation punctuation in the text of the document and not use q markup. That way, automated conversions are still possible, and quotation punctuation, if any, should be rendered correctly by any OSIS reader.

The q element of OSIS, along with its marker attribute, is used to mark the beginning and end of quotations, and give the appropriate punctuation for that particular translation's language and style. In addition, markup for quote continuation reminders at the beginnings of paragraphs are encoded with the milestone marker with attribute type of cQuote. For example, the beginning of a paragraph in the middle of a one level deep quote would begin with the empty XML element
<milestone type="cQuote" marker="“"/>. Note that in every case, quotation punctuation may be in the marker attribute of q or milestone elements, or in the text of the verses, but never both. In addition, only one of those options should be used in any one OSIS document. In other words, quotation punctuation for any one OSIS document must either all be in marker attributes of q or cQuote milestones, or all in the text, like any other punctuation. The former is “more pure” OSIS. The latter is much more practical with respect to automated conversions from existing texts in commonly-used Scripture file formats.

There is a totally separate, but related, use of the q marker in OSIS. That is to mark Words of Jesus for optional rendering in red (or another font attribute) or use as a search attribute. Although this is not as common in minority languages, there are some projects where the people want such features in their own language translation, because that is what a Bible looks like in the national language. There are many ways to handle this in OSIS. The simplest way is to just add who="Jesus" to existing q markup, but this is inconsistent with the desire to be able to select any verse or verse range and render it properly without having to consult prior verses. Therefore, the recommended way to handle this is to use additional q markup in container form (not milestone pairs) that don't cross verse boundaries or each other, and that have empty marker attributes and the appropriate who attribute. In some cases, the two uses of q markup can be combined simply by putting the actual punctuation in the marker attribute, but if the quote of Jesus crosses at least one verse boundary, separation of these functions is recommended.

An analogous markup is quotations of the Old Testament in the New Testament. Some translations render these differently. For example, the NASB uses small caps, and the NKJV uses slant text for these. These could also be used as a search parameter in Bible study software. In USFM, these are usually marked with \qt ...\qt*. In OSIS, there are several ways, but the recommended way for consistency is to use the seg element with a type attribute of otPassage.

For example, in the most complex case, where quotation marks are put in marker attributes and both Old Testament quotes in the New Testament and Words of Jesus are marked, there would be a q milestone pair at the beginning and end of the sermon on the mount, plus q container markup in each verse. For example, Matthew 5:21-22 might be encoded like this:

<verse osisID="Matt.5.20">For I tell you that unless your righteousness exceeds that of the scribes and Pharisees, there is no way you will enter into the Kingdom of Heaven.</verse></p>
<p><verse osisID="Matt.5.21"><q who="Jesus" marker=""><milestone type="cQuote" marker="“">You have heard that it was said to the ancient ones, <q who="OT" marker="‘"><seg type="otPassage">You shall not murder;</seg></q marker="’"><note type="crossReference">Exodus 20:13</note> and <q marker="‘">Whoever shall murder shall be in danger of the judgment.</q marker="’"></q marker=""></verse>
<verse osisID="Matt.5.22"><q who="Jesus" marker="">But I tell you that everyone who is angry with his brother without a cause<note type="translation">NU omits “without a cause”.</note> shall be in danger of the judgment; and whoever shall say to his brother, <q marker="‘">Raca<note type="translation">“Raca” is an Aramaic insult, related to the word for “empty” and conveying the idea of empty-headedness.</note>!</q marker="’"> shall be in danger of the council; and whoever shall say, <q marker="‘">You fool!</q marker="’"> shall be in danger of the fire of Gehenna.<note type="translation">or, Hell</note></q marker=""></verse></p>

In the above example, the q elements for Jesus' Words have empty marker elements, because the actual quotation starts before these verses and ends after them. Also note that the container form of paragraph markup could be used, but if a paragraph boundary was within a verse, it would have forced the paragraph markup to go to the milestone pair format. It is acceptable, when always using containers for verse markers, to always use milestone pair markup for paragraphs.

In a simpler case, where no red letter editions are anticipated, Old Testament quotes in the New Testament will not be typeset any differently, and the OSIS text is converted from a legacy text where quotation punctuation is in the text without any markup indicating what is what, it is acceptable to encode the same passage like this:

<verse osisID="Matt.5.20">For I tell you that unless your righteousness exceeds that of the scribes and Pharisees, there is no way you will enter into the Kingdom of Heaven.</verse></p>
<p><verse osisID="Matt.5.21">“You have heard that it was said to the ancient ones, ‘You shall not murder;’<note type="crossReference">Exodus 20:13</note> and ‘Whoever shall murder shall be in danger of the judgment.’</verse>
<verse osisID="Matt.5.22">But I tell you that everyone who is angry with his brother without a cause<note type="translation">NU omits “without a cause”.</note> shall be in danger of the judgment; and whoever shall say to his brother, ‘Raca<note type="translation">“Raca” is an Aramaic insult, related to the word for “empty” and conveying the idea of empty-headedness.</note>!’ shall be in danger of the council; and whoever shall say, ‘You fool!’ shall be in danger of the fire of Gehenna.<note type="translation">or, Hell</note></verse></p>

USFM/OSIS Tag Correspondence

OSIS and USFM have significant differences in design, syntax, and semantics, so there isn't a simple one-to-one correspondence for all tags. There is, however, enough of a correspondence that USFM (and other dialects of SFM) can be automatically converted to OSIS without significant manual intervention, at least for the main body of Scripture portion. Some information from USFM texts is used in many places. For example, the value of the chapter from a USFM \c marker gets repeated in every OSIS ID for every verse in the chapter, and may be used in a scope element or attribute. Therefore, the list of corresponding markups is just intended to point you to the correct place in the OSIS Users Manual and OSIS schema, and not as a simple search and replace operation. It takes more processing than that to convert, as much of what is implicit or derived information in USFM is made explicit in OSIS. There is also a strictly-nested XML structure to OSIS that is not in USFM (except by inference). Therefore, this list is really more of a list of clues as to which of many alternate encoding options within OSIS are recommended for each USFM markup.

Appendix F of the OSIS Users Manual contains a list of corresponding OSIS markup for most USFM markers. Here are a few that could use some clarification:

USFM

OSIS

Comment

wj

q[@who="Jesus" marker=""]

Words of Jesus Christ. See the discussion above.

c

chapter

Note that the chapter value must be remembered for use in osisID values within the chapter.

k

seg[@type="keyword"]

The USFM \k for keyword can be a paragraph style or a character style, depending on context.

nb


This marker is not necessary in OSIS. It is enough to mark the actual paragraph boundaries with p, and it is OK if paragraphs cross chapter boundaries.



References

OSIS 2.2.1 Schema: http://www.bibletechnologies.net/osisCore.2.1.1.xsd

OSIS Users Manual: http://bibletechnologies.net/utilities/fmtdocview.cfm?id=28871A67-D5F5-4381-B22EC4947601628B&method=title

OSIS web site: http://bibletechnologies.net/

Unified Standard Format: http://ubs-icap.org/usfm