Bible File Encoding for Bible Translators, Publishers, and Software Developers

By Kahunapule Michael Johnson

Welcome to the new Tower of Babel. Computers have done a great deal to assist us in the process of Bible translation, Bible publication (both in print and in electronic forms), and Bible translation revision. Unfortunately, there are many issues surrounding the ways that computers are used to process Scripture texts, most notably:

Rapid hardware technology change and obsolescence
Rapid software change and obsolescence
Practically unlimited ways of doing essentially the same things
Changes in standards and the tension between standards and flexibility
Different ways of legitimately looking at the same data

This document primarily addresses the way Scripture texts can be encoded as computer files to obtain the maximum benefit.

Character Encoding and Writing Systems
Scripture Markup Standards
Why not just use commercial software file formats?
Single source, many uses
Challenges of an XML Scripture Encoding Schema
Scripture Publication Formats
The Overlap Problem
The quotation problem
    The Importance of Lossless Encoding
    Reading Scripture Files and Rendering Quotations
    My Quotation Punctuation Bias
Scope definition and complexity control
Compatibility
Standardization
Conclusion

Character Encoding and Writing Systems

One of the most basic encoding decisions that makes it possible to process texts like the Holy Bible and even this document is to set up a correspondence between the characters we write with and numbers. Computers fundamentally deal with binary integers in their circuitry. From these binary integers, computer scientists and programmers have come up with ways of encoding decimal numbers, hexadecimal numbers, floating point numbers, and even alphabet characters and punctuation. Back when transistors were expensive and computers were a new thing, people encoded letters with as few as 5 binary digits (bits). This was seriously limited, and required using “shift” characters to change modes between upper case, lower case, and “figures”. A more common encoding came later, called the American Standard for Information Interchange (ASCII), using 7 bits per letter. This supported 127 characters, including the entire English alphabet, with both upper and lower case characters, common punctuation, and control characters. Later, when more characters were needed, another bit was added to the characters, doubling the number supported to 256. This allowed the inclusion of certain characters common in the major European language alphabets, and some line drawing characters. This code space soon got crowded, and a concept of “code pages” was invented, where the exact mapping of code points (numbers) to characters varied by regions, so that code pages could be optimized for a given language. This, of course, complicated computer font design. If there wasn’t a code page that worked for your situation already, you could invent one, along with custom fonts to match. Now, just knowing the value of a code point is not enough to know what the character represented might be, unless you know the code page definition and the font(s) used.

Currently, the best solution to the character encoding dilemma is to use Unicode. You can read more about Unicode and Bible translation here. Unicode by itself is a great help, but even better is the support for multiple writing systems that use Unicode, such as the Graphite project. Unicode is one ray of hope to escape Tower-of-Babble confusion in Scripture file handling. The other is Scripture markup standards.

Scripture Markup Standards

Scripture markup standards specify the way we encode Scripture data in files with markers that tell us what book, chapter, and verse we are in, what kind of text this is (i. e. prose, poetry, Hebrew Psalm titles, introductory notes, footnotes, etc.), and perhaps some information on the target language, where it is spoken, etc. There are several of them worth mentioning that are usable for Bible translation authoring, editing, checking, and publication, and which are open standards: USFM, USFX, Legacy SFM, OSIS, XSEM, GBF, and Zefania. The table below highlights some of the significant features of each, and gives links for more information on each of them, along with my personal biases and recommendations. This table is arranged with the most recent standards on top.

Encoding	Description	Encodes poetry structure and character styles	Supports preservation of quotation punctuation	Complexity	Software support	Acceptance level	Recommendation
USFX (18 March 2005 - present)	USFX is just an XML schema to represent USFM data in a fairly straightforward, simple manner. USFX does have a few extensions that may make it useful as a conversion hub between USFM and other standards, such as OSIS.	Yes	Yes	Easy	The Haiola is its main use, but there are a few others. However, by converting to USFM, you can use all existing applications that support USFM.	USFX started as an internal standard for the Onyx project (since renamed to WordSend and then mostly absorbed into Haiola, but it is freely available for others to use. Since it is based on USFM and easily convertible to and from USFM, its usefulness depends entirely on the acceptance of USFM. USFX changes to enable new features where it makes sense to do so, with an eye towards backward compatibility.	Use USFX in conjunction with USFM where it makes sense with new software tools, or as a stepping stone between USFM and whatever you want to do with XSLT and other XML software tools.
USFM	This is the preferred Scripture markup file format for current use by field Bible translators. It uses “\” codes to mark book, chapter, verse, paragraphs, etc. There is a companion XML format, called USX, which maps one-to-one to USFM markers, and is defined by how Paratext generates and imports it.	Yes	Yes	Easy	Excellent: Paratext, Translation Editor, and multiple publication paths via SIL Pathway, Publishing Assistant, Haiola, etc.	Widely accepted and used	This is the most conservative and reliable Scripture file format to use for current projects. Although USFM is not an XML format, its companion USX is, and it can also be converted to USFX, which is XML, with no loss of information, using free software. USFM's twin, USX, is the preferred format of the Every Tribe Every Nation Digital Bible Library.
OSIS	This is an XML format that results from an attempt to create a universally-accepted Scripture interchange format that could replace the others.	Yes	Sometimes. It depends on how the <q> and <speech> markers were used when the document was created.	Very Difficult	Poor (as I write this, I'm aware of one Microsoft Word 2003 plug-in, but I don't think it got much use)	The International Forum of Bible Translation Agencies appears from their web site to have endorsed the development of OSIS, but OSIS isn’t yet widely accepted and used by the actual working linguists within any of these organizations. The American Bible Society used to hose the BibleTechnologies.net web site for OSIS, but pulled the plug on it in February 2014.	OSIS should only be used with caution. It is probably not a good choice for archival use or massive conversions of texts because of the great deal of manual labor involved and the excessive ambiguities of OSIS. An improper subset of OSIS called OXES is a little better, but never seemed to get much traction in actual use.
Zefania (September 2004)	This is a very simple XML markup that is useful for some Bible study software applications, but it does not allow markup of many of the kinds of text present in a practical Bible translation.	NO	Yes	Easy	Limited to some Bible study software programs	Accepted by the developers of the Zefania Bible study software	Zefania is not suitable for use by most Bible translators, nor is it good for Bible study software publishers who would like to display poetry like the translators envisioned it. It is simple, but just a little too simple.
XSEM (September 2001)	The XML Scripture Encoding Model (XSEM) was intended to do and be the same thing OSIS is attempting, but didn’t gather a critical mass of followers. It is superior to OSIS in some ways.	Yes	Maybe. The documentation is a little unclear in this area, but XSEM may suffer the same defect as OSIS.	Difficult	Poor, currently limited to a GBF-to-XSEM converter as far as I know.	Initially embraced by SIL, but now SIL is using USX instead of OSIS or XSEM based on some ease of software development issues related to the recursive definitions used in XSEM. Although XSEM is technically superior to OSIS in many respects, lack of actual use by software developers and working linguists makes this a dubious standard.	Since I don’t know of anyone besides me who wrote any significant software to use XSEM, and I will be concentrating on where I perceive the practical advantages (USFX, USX, and USFM) in the future, this is probably not a good encoding to use.
GBF (January 1998)	This is a simple markup using markers in <> pairs, but it is not XML. It was invented independently of both SFM and XML, and does not support all the features of USFM. It has been used in some Bible translation and publication tasks, however, because it encodes a good minimal set of Scripture text types.	Yes	Yes	Easy	Converters to HTML, text, XSEM, OSIS, TeX, RTF, etc., exist. The Sword Project supports GBF for import to their Bible study software.	Limited in use and scope.	GBF should not be used for new projects, except possibly as a step towards conversion to a more standard format. New projects should use USFM, USX, or USFX, instead.
Legacy SFM (various versions have been around for decades)	These variations of “\” code sets (Standard Format Markers) have been in use in various places for both Scripture files and other uses, like dictionaries.	Yes	Yes	Easy	Good: Paratext with custom style sheets, some Word Macros	Accepted and used in various forms in various places, but no one variant is widely accepted.	Users of older SFM standard sets of codes for marking up Scripture texts should consider upgrading to USFM for superior archive value, compatibility with new software tools, and flexibility in typesetting locations.

As you can see, there are multiple competing standards, each with their own advantages and disadvantages. I could say that we should just all use USFM and dispense with the rest, but it isn’t that easy.

If you haven’t guessed, by now, I consider USFM to be the current best choices for Bible translators to use. However, USFM has a couple of problems. The most significant of them is that it isn’t XML, but with easy conversion to and from USX (via Paratext) or USFX (via Haiola), this is not a significant problem in any practical sense.

One of the blessings of XML is that it is very flexible. You can represent almost any data in an almost unlimited number of ways. That is also one of the curses of XML. To be really useful, the scope of what you can do with XML must be restrained with a schema (or with a DTD) to essentially define what the markup options are, and what they mean. This should also be supplemented with documentation that explains the proper use of the XML schema in more detail than the schema XML document does itself. One of the first serious attempts to create an XML Scripture format that would be suitable for replacing Standard Format Markers (SFM) was XSEM. Enthusiasm for this schema seems to have peaked and waned before anyone but me wrote any software to support it— and I didn’t write much. Later came OSIS, trying to learn from XSEM and be all things Scripture to all people. Almost. After that, SFM was given a boost with the USFM definition, which attempted (and, I hope, succeeded) in unifying the many SFM variants in the world. USFX was invented as a very straightforward conversion of USFM to XML. Shortly after that, USX was independently invented as an alternate way to directly convert USFM to XML and back with no loss. Both are good. I think USFX has a few practical advantages, so I still use it as a hub format, but the good news is that you don't have to chose just one or the other, since with both Paratext and Haiola, you can freely convert between them. Of course, not everyone has access to Paratext, but Haiola is free.

Why not just use commercial software file formats?

We use file markup formats designed specifically for Scripture texts for three reasons:

Longevity.
Ease of adapting to different uses, such as different size books, electronic books, etc.
Standardization makes writing software to assist in the Bible translation and publication processes easier and of benefit to more people.

Commercial software formats change over time and sometimes go obsolete. For example, Adobe Pagemaker is no longer being developed, but has been replaced with Adobe Indesign. Adobe Indesign cannot read Pagemaker 5.0 files. Does anybody remember WordStar? How about Borland Sprint? Ventura Publisher is no longer being maintained, and may not be available for long. Archiving such file formats would require archiving the software AND the computer it ran on for reliable recovery, as well as the data itself. Even then, there is no guarantee that an antique computer will keep running indefinitely.

Commercial proprietary software formats are harder to parse to do the things we want to do with Scripture files, like running checking programs like those that come built into Paratext... and there are many of them, so the job would have to be done many times. The same is true when we use multiple standards for Scripture file formats, unless there are easy ways to convert back and forth between “standard” formats.

Commercial software allows many inconsistent and different ways of doing things, all assuming that the output desired can be obtained any way you like, as long as the page looks OK. That isn’t necessarily true, as some ways of doing things, like relying heavily on manual formatting instead of styles, make working with the results much harder for a publisher.

It is reasonably easy to write software to convert from a format like USFM to a commercial software format like RTF, but very difficult to convert in the other direction. Indeed, a generalized automatic conversion from RTF to USFM is not possible unless constraints are placed on the RTF like use of specific style names to correspond to USFM tags.

In other words, by using a standard Scripture file format, the data remains intelligible and usable longer, and it is easier to use it for many different purposes.

Single Source, Many Uses

One of the reasons we use Scripture markup like USFM is so that we can adapt the data for many uses, with no change in the source. Part of what makes this possible is to use meaning-based markup instead of presentation-based markup. In other words, we mark what a style of text means, not how to display it. For example, we mark something as By merging the Scripture data with external style information, the same Scripture files can be used for many applications. You may want to have a standard printed edition, a large-print edition for the visually impaired, a pocket testament, HTML for display on the World-Wide-Web, and modules for use in Bible study software. This can all come from the same source. Therefore, the markup doesn’t say how a verse marker should be displayed, just that verse 10 (for example) starts here. Programs like usfm2word.exe in the Onyx project take that information, and feed it to Microsoft Word 2003 as a verse number in “Verse marker” style, which is defined in the “seed file” used for that conversion. This “seed file” specifies the exact font typeface, size, and relative position to use for the verse marker. The same sort of thing is done for normal Scripture prose and poetry, and for titles and such. Producing a different size volume is a matter of specifying a different seed file, with the appropriate style definitions changed. That is why there is intentionally no way to specify “use 10-point Helvetica Bold here” or “set paragraph indent to 3 mm.” in the markup languages. There is a way to say “this is a subtitle”, “this is an inscription”, or “this is the 2nd line of a poetry line set.”

Scripture Publication Formats

In addition to the formats listed above, there are some Scripture publication formats that are worthy of mention, even though they are not as useful for the entire chain of Scripture translation through publication. Almost every Bible study software package seems to have its own unique format, usually a proprietary one. Sometimes the format, or an import format, is made known so that new translations can be imported (like with Bibleworks 5 and later). Sometimes it is kept as a trade secret (as with Libronix). There was one earlier attempt to create a common format for Bible study software from different publishers to use the same Bible files with each other’s programs. This attempt, called Standard Template for Electronic Publishing (STEP) is still in use by some, but the committee that controlled it seems to have disbanded.

Other more general book formats work for electronic publishing work, too, like PDF, Microsoft Reader, HTML, etc. These each have their advantages and disadvantages.

The most important thing to notice about Scripture publication formats is that there are more of them than I have bothered to count, but all of them can be generated, with varying degrees of automation or difficulty, from a good format like USFM. For example, USFM can be converted to WordML using Onyx, then using plugins or native features in Microsoft Word, converted from there to PDF, Microsoft Reader, Rich Text Format, HTML, etc. You can also use Onyx to convert USFM to USFX, then use XSLT to convert to various formats, like other XML formats, HTML, and assorted Bible study software import formats.

Challenges of an XML Scripture Schema

There are five challenges that an XML Scripture schema must overcome to be successful:

The Overlap Problem: XML elements must be neatly nested, but Scripture text elements are not. Chapters and verses overlap with sections, paragraphs, and stanzas, which overlap with quotations and character styles.
The quotation problem: there are multiple legitimate and some illegitimate ways of looking at quotations, and these are not compatible with each other, especially in the context of the multitude of living languages and their various styles.
Scope definition and complexity control. Should a Scripture file format include tags to embed video clips? I don’t think so, but someone will. It turns out that getting a consensus on this is hard.
Compatibility with existing Scripture file formats, including the ability to convert back and forth.
Standardization: the need for a standard is so strong that even a bad standard will be accepted and used if enough other people do so.

There are also some less challenging considerations to cover:

Versification and referencing. Some of the early Bible study software programs assumed that every Bible used the same versification scheme as the King James Version of the Holy Bible. There are, however, many slight variations in that scheme in use in existing Bible translations. Sometimes two different versification schemes are encoded in the same translation. It is good to have a description of what versification system is desired for the purpose of checking new translations against. When scrolling Bible translations with differing versification schemes in parallel in Bible study software, a list of equivalents to a "standard" versification is useful. For example, displaying 3 John 15 in the UBS 4th Edition Greek New Testment or the NRSV should bring up a parallel display of 3 John 14 in the NIV Bible. Versification schemes can be specified in separate documents (useful for checking new translations), or simply inferred from the markup of what is present in the Bible. Reference equivalents may be more useful to include in the Scripture file itself for publication purposes.
Digital rights management. Even though I invented and contributed the encryption algorithm used by The Sword Project for point-of-sale control of copyrighted, restricted-distribution Bible translations, I don’t particularly like applying such controls to Bible files. Somehow, it goes against the grain of my mission in life to translate, proclaim, publish, and live the Word of God as widely as possible, and help those who do. Nevertheless, I respect the legal rights of the copyright holders, and realize that this sort of thing is necessary for them to allow electronic distribution of their Bible translations in some cases. My view of digital rights management with respect to Scripture files is that it is best applied as a wrapper around the unencumbered format, be that USFM or whatever. These files can then only be “unwrapped” and read by approved software when given a valid registration ID.

The Overlap Problem

XML is a wonderful thing, but it requires that all elements be strictly nested. You can’t let ranges of elements overlap. For example, <bold>Bold stuff <italic>bold and italic stuff</bold> italic stuff</italic> is invalid XML. Such a thing would have to be arranged as <bold>Bold stuff <italic>bold and italic stuff</italic></bold><italic> italic stuff</italic> or something like that. Scripture files tend to follow two different hierarchies that do not properly nest: the book/chapter/verse hierarchy, and the book/section/paragraph or stanza/character style hierarchy. In addition, it is sometimes desirable to mark quotations with markup, and those don’t always nest within either of those hierarchies. The solution is to allow one of the main hierarchies to exist as an XML hierarchy, and mark the other one with “milestones” or XML elements that mark the logical start and end of an entity that overlaps the other hierarchy, but do not contain that entity.

Bible translators and print publishers prefer to look at the Bible as poetry and prose, primarily, with chapter and verse markers being treated as points in the text that are marked. This approach encourages looking at the Bible text as it was originally written, rather than as a list of verses. Bible study software designers prefer to look at the text the way they search it: as a database of verses, arranged in a book/chapter/verse hierarchy. Zefania is an example of an XML hierarchy designed by software developers. It actually ignores paragraph and poetry structure, and just encodes the contents of verses in XML elements. XSEM is more complete in that it handles all three structures (paragraph, verse, and quotation), but emphasizes the print publication application by making the paragraph hierarchy primary, and always encoding verse markers and quotation markers as milestones. OSIS tries to be all things to all people by allowing you to encode verses and quotations either as containers or milestones, but falls short in the all things to all people department in that it doesn’t support paragraph boundaries of various sorts as milestones.

USFX basically does what XSEM does, but with the intention of later extending it to another, related schema that has verses as containers. This latter solution is not yet complete, in that the second schema is not yet fully defined and documented, but the idea is that you would author in USFX or USFM, then convert to this other schema without losing any information, then end up with a structure that is optimized for on-the-fly web page generation and Bible study programs.

With the ease of converting from one XML to another XML format via either XSLT or procedural programming languages, starting with a schema that gives priority to the paragraph and stanza structure of text and later converting it to a schema that gives priority to verses is not a problem. No matter which structure is more naturally represented by the XML structure, the other structures can be represented with milestones.

The Quotation Problem

There are many ways to look at quotations relative to a markup of Scripture text, each with their own advantages and disadvantages. Because of this, some serious disagreements exist with respect to how they ought to be handled. You would think that such little jots and tittles would not cause such problems, but they do, especially because we aren’t talking about marking up a novel or something. We are talking about the Word of God. I am rather passionate about God’s Holy Word. I think that my biases will be obvious here, but I hope to be able to fairly represent other views, here, too.

First, let us consider the theological implications of quotation punctuation. The original Greek and Hebrew manuscripts did not use quotation marks. This means that some people claim that they can freely add or subtract quotation marks as they see fit. I disagree with that, because even though the source languages did not require or use quotation marks, target languages often use them or even require them. Quotation punctuation is not part of the source text, but it is often an integral part of the target language Bible translation, and as such must be preserved accurately. While some variations are perfectly acceptable in some target languages, they are not acceptable in others. Therefore, the only safe policy, in my opinion, is to preserve accurately whatever the translators decided.

Different target languages use different methods of marking quotations, some with punctuation and some with spoken markers. Sometimes multiple methods of marking quotations are allowed. Even within one language, there are variations in dialects, changes in usage over time, and different stylistic choices in the way quotations are punctuated. Sometimes, multiple methods of marking quotations are acceptable. Sometimes this is not true. It all depends on the language, the dialect, the target language style choices, and the translational choices made by the translators. In my view, it is up to the translators to prayerfully decide what is and is not acceptable with respect to punctuation of quotations. Their decisions should be embedded in the encoded Scripture text unambiguously, if not by direct representation of the actual punctuation, then by markup that can be used to create the punctuation in a way that is approved by the translators for that particular language. I do not believe that placement of quotation punctuation should be left to the discretion of programmers who don’t know the target language, like unmodified OSIS most likely would.

Traditionally, with SFM, quotation punctuation has been done by putting the correct punctuation in the text of the document itself, just like any other punctuation. However, because most keyboard layouts don’t allow for easy typing of typographic (curly) quotes, people have used typing shortcuts like << for “ or the French-style opening quotes. These get changed to the real corresponding punctuation through some sort of global search-and-replace operation, so the end result, for archiving or printing, contains the actual correct punctuation, as placed by the translators. This is a good and acceptable solution that probably covers 90% or more of the needs and desires of Bible translators and publishers.

What about the other 10%?

There are some other things that people sometimes like to do with quotations, that are not necessarily covered just by including the proper punctuation. These include:

Consistent initial generation of quotation punctuation from markup, according to the standard rules appropriate for that translation.
Consistent re-generation of quotation punctuation from markup, according to an alternate set of rules appropriate for a translation.
Automatic readjustment of quotation levels when pulling a quote from Scriptures into another document.
Rendering of selected quotations with an alternate text style.
Performing advanced search functions in Bible study software based on (among other things) who was speaking.

Item #1 can be a big help in avoiding errors where someone adds a paragraph or stanza break in the middle of a quotation, but forgets to add the open quote reminders, and in balancing the even/odd level quote mark alternation in English and similar languages. Indeed, some people have mistakenly assumed that we can just program the standard English quotation punctuation rules into all readers of Scripture files, always using markup for beginning and ending of quotations instead of quotation punctuation, and just let the machine sort out the details of placing the actual punctuation as a matter of display processing. While this view is obviously (to me) deficient in that it disregards the impracticality of coding the rules for quotation punctuation of all Bible translations into all software that would read the same file, there is substantial merit in making provision for this sort of processing in the cases where it makes sense. The place where it makes sense is in the initial translation and revision of a Bible translation. The process of inserting quotation punctuation according to rules should be under the control of the Bible translators. The translators should also be able to specify exceptions to the rules. Natural languages are like that. They have different rules. Some have open quote reminders. Some don’t. Some use different characters for open quote reminders than for the initial opening. Sometimes quotes can be ended in many ways. Rules have exceptions, and sometimes the exceptions just aren’t something that is easy to code into a rule. I speak from experience, here. I wrote the standard English rules of quotation punctuation into a checking program, and corrected many mistakes in a Bible text. I also found exactly two places in a whole modern English Bible translation where the rules just didn’t work right, and at that point, was very thankful that I had written a checker, not a generator that had to be relied on to get everything right every run. It is a simple thing to ignore two warning messages from a checker program, but a different matter to fix a rules-based quotation generator to get things right every time. USFM has no facility to do this (although USFX has been extended to allow this), but there are programs supporting USFM that allow checking a translation against a set of quotation punctuation rules. XSEM and OSIS almost support this feature, but allow for no exceptions, and present grave difficulties in implementation.

The need for item #2 should become rather obvious shortly after using the facilities of #1. After making some edits to the text, probably including changing some quotation markers, you may want to run the program again to generate quotation marks. Of course, if the quotation marks are already there, this is bad, because you then start building up a large inventory of excess punctuation. It gets even more complicated when you change the rules or the punctuation used. Even if your target language only uses one set of punctuation and rules that go with that punctuation, with developing orthographies, it is not unheard of for people to decide they want to change things, possibly including punctuation. Just saying that the punctuation should not be put into the encoded source file is not a good idea, because then there are some serious problems with writing readers of the encoding. I think that both XSEM and OSIS make this error, although the OSIS committee has been more vocal about insisting on this option. A third option that works much better has been built into USFX, as an extension beyond what USFM supports. This option is to mark quotation start and end points with <quoteStart /> and <quoteEnd /> milestones. A process can then be run to generate or regenerate quotation marks according to rules specific to this particular Bible translation, placing the appropriate quotation punctuation inside the quoteStart and quoteEnd elements, and placing generated open quote reminders inside quoteRemind elements. For example:

He said, <quoteStart l/>She said, <quoteStart />My head hurts!
Please stop pulling my hair!<quoteEnd /><quoteEnd />

after one run becomes:

He said, <quoteStart>“</quoteStart>She said, <quoteStart>‘</quoteStart>My head hurts!
<quoteRemind>“‘</quoteRemind>Please stop pulling my hair!<quoteEnd>’</quoteEnd><quoteEnd>”</quoteEnd>

Then, if the paragraph break is removed and the standard English quotation punctuation generation process run again, this becomes:

He said, <quoteStart>“</quoteStart>She said, <quoteStart>‘</quoteStart>My head hurts! Please stop pulling my hair!<quoteEnd>’</quoteEnd><quoteEnd>”</quoteEnd>

Of course, if you just run the same process again without changing the text, you get the same thing back, as quotation marks are removed from their containing markup then put back. The beauty of this idea is that a reader of this markup that doesn’t know the right punctuation generation rules for the translation in question (more likely if it isn’t standard English), the default action is simple: just use whatever is in the text from the last automatic generation action. Indeed, the markup and punctuation could all be done manually if you wanted to do that work, with lots of flexibility. Exceptions can be coded by simply putting the punctuation in the main text instead of quoteStart and quoteEnd elements.

#3, Automatic readjustment of quotation levels when pulling a quote from Scriptures into another document, deals with things like turning:

God said, “Let there be light,” and there was light.

into:

The third verse in the Bible says, “God said, ‘Let there be light,’ and there was light.”

(Note the transformation of the quote marks around “Let there be light” from double to single quotes.) The same sort of operation would add the appropriate opening and/or closing quotes to a partial quote. This is a rather esoteric operation of no practical value to most Bible translation and publication chores, but it could be a useful feature in some Bible study software that supports such processing on the extraction of quotes. Then again, some people might argue that this is changing the Bible translation text— and it might be if the punctuation rules applied didn't match the punctuation rules that should have been applied.

Item #4, rendering of selected quotations with an alternate text style, has deep roots in the tradition of publication of majority language Bibles, but is not commonly used in minority language Bible translations. The two most significant examples of this are the so-called “red letter editions” that render the Words of Jesus in red print, and the rendition of Old Testament quotes in the New Testament in some different type face. The NASB uses small caps for the latter, while the NKJV uses slanted (not italic) text. These two styles may overlap, too, when Jesus quotes an Old Testament passage. People get religious about this; with some claiming that it is wrong to mark the Words of Jesus in red, as if the rest of the text were less inspired, while some value this admitted tradition of men as a valuable help, just like the subtitles and such that people tend to mix in with the actual Scripture. Nevertheless, for a markup language to accurately encode a reference translation like the NKJV, these text styles should be supported. Besides, just because the words of Jesus are marked with a different style doesn’t mean that they must be printed differently. Many publishers sell both “black letter” and “red letter” editions of the same Bible translation, charging a little more for the latter to cover the extra printing press pass and the red ink. My view of the situation is that I could take or leave printing the Words of Jesus or OT quotes differently, but in cases where publishers are likely to want to do that, I would rather have the translators encode where these things are than have someone else add the markup later, since I have more faith in the translators’ ability to pay attention to these details. They need to do so anyway to get the punctuation right, anyway, in most languages.

Here is one of the first main divergences between the philosophy of of USFM vs. OSIS or XSEM. In USFM, there is no markup for quotation mark beginning and end. There is only markup for text styles associated with OT quotes and Jesus’ quotes (\qt ...\qt* and \wj ...\wj*). These text styles are not allowed to cross paragraph boundaries, so must be stopped and restarted at paragraph boundaries when the quotation continues past the paragraph boundary. On the other hand, OSIS and XSEM both lack character style markers corresponding to the USFM markup. Instead, you are expected to derive that style information from the markup for quotations or speech, using the “who” attribute. Of course, variations in reasonable markup of the who attribute make mapping to these styles a little problematic, but it can work. At least it could work, if the quotation markup of OSIS and XSEM were not defined to always cause generation of punctuation. As of OSIS 2.1.1, published in March 2006, OSIS can indeed be used in that way by specifying a marker attribute for <q> or <speech>.

Item #5, performing advanced search features of Bible texts based on who was speaking, is actually a feature that I don’t think exists in any Bible study software I have used, so I obviously would not miss its absence. It is a WIBNI (wouldn’t it be nice if) creeping feature that might some day catch on. In the mean time, an external, translation-independent database could reasonably be constructed with this information, and this feature, or something like it, could be implemented in Bible study software without burdening the people encoding every Bible translation imaginable.

There is a tension between some of these features, but it is possible to allow for all of them, even if it is not possible using either USFM or OSIS.

The Importance of Lossless Encoding

I tend to look at any encoding of text, Biblical or otherwise, in light of what I studied in the disciplines of data compression and cryptography. In both of these fields, there are both lossy and lossless encodings. Probably the more familiar example in compressed encoding of digital pictures. The most common digital picture format is probably JPEG, which takes as a parameter a “quality” setting that really determines how much detail in the picture can be thrown away. Picture files can be made much smaller by throwing away less important details, and still look almost the same. Likewise, audio files can be compressed a great deal by throwing away parts of the sound that our ears don’t hear very well or at all. However, when encoding data in which every bit is important, like the number of your bank account and the password that gives you access to it, you don't want any bit thrown away. Loss of just one digit could make your account inaccessible. If this happens to be a numbered Swiss bank account containing fifty million euros, a single bit error could be very costly, so you could use lossless compression, like the common .zip file format, but lossy compression is just not appropriate.

When it comes to Scripture file encoding, I actually adhere to a dual standard. I will tolerate some loss in metadata, some elements of formatting, and loss of portions or even entire classes of helps, such as publishers’ subtitles. I insist on lossless encoding of the actual inspired text of the Holy Bible itself, and of the translations of the inspired text of the Holy Bible. I am persistent in my intolerance of anything less. Since I consider quotation punctuation (or the absence thereof) to be a part of the text of a Bible translation, I insist when I encode a Bible translation with any given Scripture file format, then read or view that Bible translation using any standard-compliant reader of that particular Scripture file format that I get back exactly the same Bible text that I started with, including, without any ambiguity, the quotation punctuation intended by the Bible translators who produced that translation. It is OK with me if the translators specify two or more variants of acceptable ways to mark quotations, but I had better get at least one of them from any proper reader of that format. If I encode the ASV in a red-letter edition, I expect to see no quotation marks in the output. If I encode a translation that uses different quotation marks than English, then I expect to see those marks, not the English equivalents. I can do that with USFM (or USFX). I can’t do that with the current unmodified OSIS, at least not without supplemental style information that I have no assurance that anyone else using the same OSIS file will use in the same way.

Reading Scripture Files and Rendering Quotations

The weakest link in the chain of use of any Scripture file format that encodes quotations as markup is in the software that reads the files and renders the output, including all proper punctuation. This reader may be part of a typesetting program (a subject near to my heart right now), a Bible study program, a Bible search web site, a program to convert from one Scripture file format to another, or any of several kinds of aids to Bible translators. We have discussed above the many things we might want to do with quotation markup. The reader of the file format cannot operate on what we might want to do. It has to be able to read the file format and determine what we actually said we are doing this time. To do that, it must be able to determine the answers to the following questions concerning quotation markup:

Am I required to generate quotation punctuation based on this markup?
If I am required to generate quotation punctuation based on this markup, what are the markers and rules to be used?
Is this quotation to be rendered in a different text style? For example, is this an Old Testament quote, and do my style settings specify that this should be rendered with a specific font style?
Is this quotation marked only for the purpose of searches?

The answers to some questions can reasonably be obtained from external style information that need not be part of the Scripture markup, and some should be obtained from style information that is not part of the Scripture markup. For example, it is good to mark OT quotes in the NT in the Scripture markup, but not to specify that OT quotes in the NT should be rendered with 10-point Gentium italics. Deciding if this is a red-letter edition or not is something that belongs in external style sheets, even though quotes of Jesus are marked in the Scripture file itself.

Relegating quotation punctuation generation to external markup is only acceptable to my way of thinking if you don’t expect all readers of the format to do this generation. If you do expect that all readers will do this generation, then the information necessary to do it properly must be included in the Scripture file markup itself. If you only expect some but not all readers of a Scripture file to generate quotation punctuation from markup, then the generated punctuation should be stored in the file by those special Scripture file readers for the other Scripture file readers to use. The only Scripture file format in the above table that has provisions for this sort of selective generation, however, is USFX, and then only in extensions not found in USFM. This feature could be added to OSIS and/or XSEM easily enough, using attributes on the quotation markup in OSIS or defining the contents of qStart and qEnd in XSEM.

My Quotation Punctuation Bias

Allowing people to generate quotation punctuation from markup is good, because, in the right contexts, it can reduce punctuation errors.
Forcing people to generate quotation punctuation from markup is bad, because there is no way that all readers of the markup will get the punctuation right for all Bible translations. There are just too many variations in what is required and what is permitted, and trying to specify all possible variations is too complex. Even if you could get it right, forcing quotation punctuation into the markup instead of the text of the Scripture introduces serious problems in automated conversion of existing Scripture texts from and to other formats.

Scope Definition and Complexity Control

XML schemas can be specified to handle any kind of data in multitudes of ways. It is so flexible, you could design a schema to include Bible texts, cookie recipes, movies, and plans for constructing nuclear weapons, all in one document. Such a bizarre combination is of no practical value. That example may seem far-fetched, but there is a temptation to keep including more things in a Scripture file format than what really belongs. By striving to do one thing really well, like encoding Scriptures, it is more likely that we will get that one thing done right. It is OK, and even desirable, to have separate schemas for related things, such as dictionaries. Even when sticking to Scripture texts, there are many types of text, notes, figures, etc., that Bible translators may want to use. There is a balance between providing enough styles for everything that people might reasonably need, without overwhelming people with choices they will never need, or providing too many ways to do the same thing.

Having more choices of ways to do essentially the same thing may sound like a good thing, until you start dealing with the many varied, creative, and sometimes contradictory interpretations and implementations people come up with.

One of the keys to reliability in any software system is keeping it simple. The simpler the file formats and the simpler the processing required, the more reliable the software that processes the files is likely to be. K. I. S. S. — keep it simple, sir. A good schema specifies file formats that are as complex as they need to be, and no more. USFM is a good example of about as complex as a Scripture file format needs to be. USFX inherits that simplicity, and tries to resist the temptation to make the data model more complex just because XML can do that more easily than backslash codes can. For example, USFX does not define sections with headers and following paragraphs. It just defines headings of various sorts, paragraphs, stanzas, and the stuff that would go in a section. The logical sections are still there and their boundaries could be determined automatically if there was a need to do so. It is unlikely, however, that there would be a need to do so for any reason other than converting to a more complex XML format.

Compatibility

To make any change, even an improvement, in a Scripture file format, absolutely requires a migration path for older data. Unless all of the software tools that operated on the old format can be replaced with the rollout, the migration path must include a bidirectional conversion. There must be a way to convert data from the old to the new format and from the new to the old format, without losing any important data. When the conversion is nontrivial, the conversion must be automatic if we are to reasonably expect people to keep converting back and forth to use both new and old software tools during the transition time. Replacing software is a slower process than we would like. However, Bible translation itself can take a long enough time for people to experience several file format transitions. If we do a good enough job thinking through the file formats and anticipating needs, seeing future problems and avoiding them, and doing the design properly, we can minimize these transitions and the disruptions they cause. We can also facilitate development of new software tools.

At this instant in time, the published specifications for OSIS and USFM are fundamentally incompatible with respect to quotation punctuation, making a lossless, bidirectional, fully automatic conversion between fully compliant USFM and fully compliant OSIS impossible. This is true even if you stick to the subset of OSIS features that USFM supports.

Standardization

There is a keen desire among many people, myself included, to have a good XML standard for Scripture encoding. Having one good standard solves many problems. Consider the standard QWERTY keyboard layout. That is what people sell, because that is what people know how to type on. People learn to type on the QWERTY keyboard, because that is what people sell. This is a little bit less of the case now, with software-defined keyboard layouts, but it still reflects a major standard. People perceive benefits in that standard. The standard doesn’t have to be technically superior to be accepted, just widely accepted. Indeed, the QWERTY keyboard layout is designed to intentionally slow down the typist, for the purpose of reducing problems with the mechanical typewriter strike bars from jamming. Except for a few unusual people like me, everyone keeps using it anyway. (I type using the Dvorak keyboard layout, slightly modified to permit me to type any character in any language being worked on in PNG.) The QWERTY keyboard layout is a widely accepted standard. This pull towards a standard is one of the factors contributing to Microsoft’s near monopoly in the PC operating system market. People write software for the most widely-used operating system, and people like to buy the operating system with the best choices in software. VHS video tape won out over Betamax, in spite of the better picture quality of the latter, because more movies were available to be watched for the former. Wide use counts for a lot.

In the context of Scripture format files, standardization provides a foundation upon which software can be built to assist in Bible translation editing, adaptation, checking, publishing, archiving, and revision. If there are many incompatible standards, software developers are left making choices as to which standard(s) to support. If lossless bidirectional conversions can be made between standards, then it is as if those standards are really just one standard. If not, then there is a separation in the tools and in the large amount of accumulated Scripture texts available in one format but not necessarily in another.

As I look around at the available Scripture file formats used for Bible translation work, I only see one with a critical mass of support: USFM and its close cousins, the SFM used by various entities before USFM was put forth. Conversion from other SFM dialects to USFM is usually a minor issue, involving a few search-and-replace operations. I see in USFM the collected wisdom and experience of decades of work. In XSEM, I see a first serious attempt to define an XML Scripture interchange schema. I expected it to become THE Scripture XML standard when it was published. I was wrong. Why? Apparently, it wasn’t fully embraced by the people who would write software to support it, for a variety of reasons, including a lack of simplicity. The next serious attempt was OSIS. The OSIS committee seems to have spent more effort in promoting their proposed standard, and on the surface, at least, they present the appearance of having the acceptance of all of the major players (i. e. the Forum of Bible Translation Agencies). This support, however, is largely an illusion until the standard is in use, and it is seriously marred by their lack of responsiveness to their customer base. Adoption of OSIS as a real, in-use standard, rather than just a proposed standard with lots of lip service, depends on both technical and social factors. I tried to support OSIS early on. I still do, to a degree, not because I like OSIS, but because I see the need for an XML Scripture interchange standard. With some minor modifications, OSIS could be workable as that standard— not elegant, not simple, but workable.

Conclusion

For now, the best choice for a Scripture file markup language for Bible translators is USFM. If you want to use XML tools like XSLT with that, you can use Haiola to convert to USFX or Paratext to convert to USX, then convert that to whatever you need, like HTML, XSLFO, etc. OSIS developed a following with the Crosswire Sword Project, and you can convert from USFM or USFX to OSIS via Haiola. Any comments?

Bible Translators’ Software Resources