by Kahunapule Michael Johnson http://kahunapule.org 2 March 2006 (2 Adar 5766)
The history of Bible file format development is diverse and displays a variety of underlying philosophies. Some of those seem obvious. Some of those are open to debate. They all impact the use and interpretation of Bible file formats. Some of these issues have profound impact on the cost and reliability of software that reads and writes these files, on the utility of these files and suitability for various purposes, and on their potential acceptance. Some of these issues impinge on the area of theology and deep religious convictions. Some have little impact other than that they need to be defined and agreed on to effectively form a Bible file format specification.
This is an area of great interest to me, because the way Scripture text is represented on disk has a profound impact on the programs that can be written to assist in the tasks of Bible translation, Bible publication, and Bible study. Bible file formats are the very foundation on which such software rests. They impact the functions available, the features that can be implemented, usability, and data exchange compatibility of software that deals with Scripture texts.
The primary purpose of this document is to provide a basis for discussion of Bible file format philosophies, and to make explicit some of the common assumptions that people make about Scripture files. I think it is important to shed light on these assumptions, because they often conflict, but with greater understanding, it should be possible to resolve some misunderstandings and to at least recognize genuine disagreements and deal with them appropriately.
A proprietary specification is the most common method used to electronically encode English Bibles for use in Bible study programs. A proprietary specification is one that is developed by one publisher for his own use or by the use of an exclusive set of publishers. It is protected by trade secret, patent, copyright, or other intellectual property laws. An open specification is one that is openly published (i. e. on the World-Wide Web) and permission granted for others to use, both to encode their own Scripture files and to read Scripture files encoded by others. Note that the use of an open specification for encoding a Scripture file does not imply that the Scripture itself is “open.” Scripture translations and some eclectic compilations from original language source manuscripts are protected by copyright law unless they are old enough that the copyright has expired or they have been explicitly donated to the Public Domain. A particular Scripture translation may also be flagged as restricted and/or encrypted to assist in the enforcement of copyright and license restrictions, even in an open Bible file format. In the case of encryption in an open format, the keys used may be kept secret, but the encryption method and file encoding would be open.
The main advantage of a proprietary specification is that it is entirely under the control of the publisher(s) creating that specification. Therefore, it can be optimized for their particular use without getting any kind of consent or action from anyone else. This can possibly lead to better software performance, better reliability. It can also be used to “lock in” a customer base to their particular software offerings, because they are loathe to pay for and replace their whole library. (Note that this is not an advantage to customers.) It also makes possible a marketing model where you give away the reader but sell the books that can be read only by that reader.
The main advantages of open specifications are the potential for freedom of choice for end users in selecting which software to use with their existing documents, freedom for developers to take advantage of existing libraries of documents that may exist using that specification, and freedom of open competition in software and Bible publishing markets. If the specification is widely used by Bible translators, Bible publishers, and software publishers, and if the specification is good, then end users benefit greatly from open Scripture encoding specifications. If the specification is not optimized for the intended use, or not widely used, it can hinder the work of those who use it.
I prefer the use of open specifications for Scripture file encoding, in general, but there are some areas where proprietary Scripture formats make a lot of sense. One of those is where a program needs to use a proprietary internal format for technical reasons that make the software work better (or work at all) for a particular application. In those cases, the program should import and export an open format. Even better is if the internal format is documented and made open.
Examples of open Scripture encoding formats include OSIS, USFM, USFX, and XSEM. Examples of proprietary Scripture encoding formats include Libronix, Olive Tree Software’s Bible reader format, the format used by Zondervan’s BibleSource for Windows, and almost all commercial Bible study software Scripture file formats. An interesting in-between standard is the Standard Template for Electronic Publishing (STEP), which is proprietary to a group of publishers, but membership in that group was open to others to join for a fee.
One issue that strongly influences the final appearance of a Scripture encoding format is the scope of what is to be encoded. Items that might included or excluded from the scope are:
Pure Canonical Scripture text.
Poetry and prose formatting.
Footnotes, cross references, and alternate readings.
Added section titles.
Primary chapter and verse markup.
Secondary chapter and verse markup.
Introductory material and helps.
Pointers to maps and illustrations.
Maps and illustrations in the same file
Interlinear texts.
Strong’s numbers or other mapping to a lexicon.
Grammatical information about words in the text (gender, number, tense, etc.).
Sidebars
Bible study notes and verse-by-verse commentaries
Topical commentaries
Topical reference indexes
Conditional line break information related to various column widths for poetry lines
Page break information
Tables in Scripture (maybe for lists of offerings?)
Tables in helps (such as for lists of equivalent weights, measures, and monetary units).
Front and back matter.
Cover designs
Metadata such as library catalog information.
Markup indicating beginning and end of quotations.
Presentation forms of quotations.
Markup of quotes indicating who is speaking.
Markup of various types of text that may be presented differently, such as Old Testament quotes in the New Testament, words of Jesus Christ, inscriptions, letter openings and closings, extended quotations to be indented, words translated from different source texts, words in the target language that are borrowed, words supplied from context to make the target text make sense that aren’t verbatim in the source text, etc.
Audio clips
Video clips
Recipes for chicken soup
Academic grammar papers
Mathematical treatises
Photo albums
Theology dissertations
Sermons
Feature-length movies
Extension mechanisms to include subjects or attributes not foreseen in the specification
Presentation-specific information (fonts, column widths, etc.)
Highlighting and coloring information
Other stuff
Note that the list above contains some things that clearly belong in a Scripture file format, and some things that clearly do not. There are also many things that people would tend to disagree on. About the only things that should be mandatory in the above list is some canonical Scripture text and markers to indicate where it goes (book, chapter, and verse). I marked the items above that I thought were good to support (at least as an option) with a period at the end of the bullet, but your list may differ from mine. In consensus-building, it may be necessary to bloat the feature set some to get more people on board, but if it gets too bloated, people start finding easier ways. Each choice brings with it implications about suitability for a given task, complexity of the software to manage that task, etc. For example, if you say that we would include pointers to image files for maps and illustrations, that is reasonably easy to specify and encode (kind of like HTML). However, if you want to embed image files in the same document, that gets a little more complicated, especially when using a text format like XML.
My take on this is that a good Scripture file format is specialized, and limited to Scriptures. For other kinds of text, there are other formats that are more suitable, like Open Document Text or Rich Text Format for illustrated, formatted text of almost any sort, and various other formats for multimedia presentations. Many things, such as commentaries, are most likely better left in a separate document with a different, generalized file format that is not designed to be a Bible file, but which contains reference links that make it easy to find the Scriptures being discussed.
It is fairly common for file formats for Bible study programs to be very "bare bones" in that they support only canonical text, maybe some footnotes, maybe a limited amount of markup for supplied text (italics), and a copyright notice. Many (like BibleWorks) do not even preserve poetry and prose formatting (which makes some translations that capitalize the beginning of poetic lines look like they are rife with capitalization errors). On the other end of the spectrum is a format like OSIS, that could probably be used to encode almost anything on the list above.
The best starting point for deciding what Scripture file format is best suited for an application is to decide exactly what this file format will be used for. Potential applications include:
Common source for many formats (i. e. HTML for World-Wide Web distribution, e-Books, pocket testaments, whole printed Bibles, large print Bibles, study Bibles, red- and black-letter editions, electronic book formats, etc.)
Bibles for Bible study software
Bible translation
Bible typesetting and print publishing
Interchange of Bible texts between different applications
Academic analysis of Bible texts
Representation of Interlinear texts
Use only with one specific software application
Non-Bible texts for Bible study software (general books and commentaries)
Authoring and publication of Bible-related books, such as commentaries
Authoring and publication of any other book or document, with or without such a thing as verse numbers
The choice of intended uses to support has a profound effect on a Scripture file format. For example, if the intended use is as a proprietary format for use with one Bible study program, then there is no need to conform to any open specification (and this might actually be undesirable), nor is there a need to support any features that particular program does not use. On the other hand, if the format is to be used for interchange of Scripture data between different applications, then there must be some agreement between developers of the various applications concerning the file format. If the format is to be used for a common source for many distribution formats, then it should encode meaning-based attributes (such as "words of Jesus Christ" or "inscription") instead of presentation-specific attributes (red text or 12-point Times Roman black small caps).
If the intended use is for publishing general books and such, then it may be better to use a more general format, like Open Document Text.
In the above list, I’m most interested in the first 7 applications, and not really interested in the last 3, because of the work I do. I’m especially interested in applications that assist in the translation and publication of minority language Bibles and Scripture portions. Your interests may vary.
There are many ways of looking at Scripture texts. Some views involve practical aspects of the way the text is used. Some involve religious points of view. The most common views are the verse database view, the print publication view, the deep overlapping hierarchy view, and the flat attribute view. Other views are possible.
The verse database view is exactly that: a database containing verses of the Holy Bible, accessed by a key that is the combination of book, chapter, and verse. The peripheral, non-canonical stuff may reside elsewhere, but in this view, the book, chapter, and verse structure is usually fixed to that used by the 66 books of the KJV Old and New Testaments. Sometimes variations to include additional books found in the Roman Catholic Bible and minor versification variations are tolerated. This view assumes that just dumping the text of the verse or verse range accessed to a computer screen is sufficient. A good example of this approach is the original QuickVerse format (which also did some interesting compression of the verse number data). Sometimes the verse text may also include formatting information, such as markup for red for the words of Jesus Christ or italics. Some even include line breaks and indentations for poetry and prose. (The Sword Project does this for some of their Bible modules, but not others.)
The print publication view looks at the Bible as a collection of books with various styles of paragraphs and titles and various text styles, where these styles are used to present different types of text (normal Scripture text, publisher’s headings, God’s proper Name in the Old Testament, etc.) in some way that looks different. In addition, there are chapter and verse markings, that may be drop-cap numbers for chapters, superscript numbers for verse numbers, or other options. One possible way to use this view is to directly author Scripture files in a standard word processor or desk-top publishing program’s format.
The deep overlapping hierarchy view tries to represent several hierarchies sort of in parallel, including the book/chapter/verse, book/section/title+(paragraph|stanza/verse), nested quotations, special text attributes, alternate book/chapter/verse marks, etc. This is probably the most complex way to look at Scripture texts, but one that can accommodate both of the previously-mentioned views. OSIS and XSEM are examples of formats using this view.
The flat attribute view really doesn’t encode a hierarchy explicitly, but rather implicitly by simply marking the beginning and sometimes the end of various entities, such as paragraphs, poetry lines, books, chapters, verses, and special text attributes. USFM is the best example of this sort of view. (USFM may be extended in the near future to allow nesting of character attributes, which would make it a little less than perfectly "flat", but as of this writing it really is flat.)
Computer science teaches us about all the nice things you can do with strictly-nested tree structures. Bible texts are not strictly-nested tree structures. There are several ways to look at Scriptures as a tree of sorts, but there are always overlapping elements to deal with. Verses don’t always align with paragraphs or poetry elements. At least one chapter break occurs in the middle of a sentence in some translations. Quotations frequently cross verse and paragraph boundaries. Text attributes such as Old Testament quotation and words of Jesus Christ sometimes overlap. When divisions are defined to include a title and following text (poetry or prose), they sometimes overlap at different levels of an outline-style hierarchy. None of this seems to be a problem at all when dealing with printed text, but it causes some interesting responses among computer scientists trained in the wonders of the tree structure. The image of a man holding a hammer and a screw comes to mind.
Why do software writers like to put things in a tree structure, anyway? Primarily because of the software tools they have at hand, including XML analysis and manipulation tools. XML forces all data to be shaped as properly nested elements, another form of a tree. This is actually a nicer external shape to be forced into than the rectangular tables of most database engines, but even those are usually linked to form trees of sorts. Tree manipulation tools are what the programmer is holding and knows how to use, in many cases.
There are multiple valid ways of dealing with some of these overlapping elements, each with their own advantages and disadvantages:
Ignore all but one tree structure, and make it dominant. This is commonly done with the book/chapter/verse structure in the most simple of Bible study programs, letting the poetry and prose formatting and other structures slip into oblivion.
Make alternate tree-structured indexes of the same flat data. I’m not aware of an open specification doing this, but would be interested to know about it if it exists.
Pick one structure as dominant and use milestones (empty elements to mark beginnings and possibly ends) to represent the other structures. XSEM is an excellent example of this sort of approach, with the book/division/paragraph or stanza/verse structure being dominant, and chapter and verse markers as well as quotations and other items handled with milestones.
Let the encoder decide which structure should be dominant, if any, and encode the others with milestones. Compliant decoders have to deal with any valid approach. OSIS does this.
Use an inherently “flat” structure like USFM to encode the elements, and let the reader interpret the results.
Encode the text directly in a markup that doesn’t require every element to be strictly nested.
Handling of quotations and their punctuation has become a religious issue for some people involved in file format definitions used to represent Scriptures. My biases will shortly become evident, but I’ll try to represent alternate viewpoints accurately, as well.
First, I would like to point out that quotation marks are not found in the original manuscripts of the Holy Scriptures, nor are they found in all translations of the Holy Bible. Rather, quotation marks are inserted by Bible translators based on context when they are required by the target language. Modern English and many other languages require the use of punctuation marks. Although it is usually clear where the punctuation marks belong, there are some places in Scripture where it is not so clear where a direct quotation ends, most notably among the Prophets. In these places, there may be some variations in interpretation by the translators. Nevertheless, it is the translators’ responsibility to prayerfully mark quotations in whatever way is most appropriate for the target language. Rules for marking quotations vary by language. There are various sorts of quotation punctuation, different rules for if and how open quotations at paragraph and stanza boundaries are handled, and stylistic choices that may be made in some languages that may or may not be equivalent. There are also rules for nesting of quotations. In English, there is an alternation of double and single quotation marks.
In English, "straight quote marks" and “typographic quote marks” are linguistically equivalent, but the latter are usually considered to look better in print. Another stylistic decision might be to indent an extended quotation (such as a letter) instead of offsetting it with quotation marks. The obvious and traditional way to handle this is for the translator to write in the proper quotation punctuation directly, as part of the text.
There are some possible benefits of using some kind of markup instead of the actual punctuation when drafting a Bible translation or encoding an existing Bible translation. They are:
1. It is harder to type typographic quotation marks than to type straight ones on traditional keyboards, so some people sometimes substitute something else, then globally search and replace later. The most common example I have seen is "<< < > >>" for "“‘’”". This is just a simple typing shortcut, but it has been included in some Bible file format specifications. This isn’t the only way to handle that issue. Straight quotation marks can be converted to typographic quotation marks on the fly, based on if there is a letter (or number) to the immediate left of it or not. This “smart quote” handling is the way Microsoft Word works by default. Another way is to customize your keyboard layout to generate the actual quotation marks to be used directly if the application you are typing into doesn’t offer a smart quotes option.
2. Markup can remove the ambiguity between the character ' or the character ’ when used as an apostrophe and when used as a right single quote. This distinction makes no difference for simple display as text, but it may be useful in the case of checking the balancing of quotation marks if the text is consistently marked up that way.
3. Markup of quotation start and end points can be used to generate the actual punctuation, provided that the appropriate rules and style of punctuation for a particular translation is available. It turns out that completely specifying such rules for any language that you might encounter, complete with when exceptions might be desired for style reasons, is hard. I’ve never seen it done properly for all possible languages. It can be done fairly simply for English and similar languages when stylistic variations are not important to preserve. Anyway, this has the potential of making initial generation of correct punctuation less tedious (depending on how tedious the markup is to draft), and also of allowing easy automation of insertion of quotations into quotations in another document. (The latter is kind of an obscure issue— one that is probably better handled by using indentation than punctuation, anyway, rendering that point somewhat irrelevant.) There are those who religiously believe that quotation punctuation should always be generated on the fly from markup, according to rules decided by whoever is interpreting the markup (probably a programmer, not the original Bible translators). I say “religiously” because their arguments don’t rely purely on logic. There are also those (like me) who religiously believe that any markup that doesn’t allow the Bible translators to completely specify how and where quotation punctuation should go is defective and should not be used. (I believe the Bible translators are more to be trusted with punctuation than others who may not even understand the target language.) The original OSIS specification was firmly in the former camp, although I understand that a revision that is about to be published will tolerate the latter viewpoint as well.
4. Markup of quotations can be used to search for words or phrases within quotations by a specific person. This is kind of an academic, hypothetical capability that could be implemented in Bible study software in the future. It could also be done without marking up every quotation with who the speaker is in every translation (which is a LOT of work), because the quoted people are the same in every translation. Therefore, my opinion is that anyone wanting to implement this feature should seriously consider using a list of quotations external to the actual Scriptures, then keying off of simpler markup that just shows starting and ending points of quotations (or even actual quotation punctuation instead of markup) within the text to find the ranges to search.
5. A more sophisticated markup of quotations allows generation or regeneration of quotation punctuation from markup at will using a process that knows the rules for this particular translation, but leaves the results in the markup (i. e. as an attribute or as text contained within special elements) in such a way that it is easy for a display process that knows nothing about the punctuation rules for this language and translation to simply use the quotation punctuation as generated.
The three main approaches to handling quotation punctuation are:
1. Just include the punctuation as part of the text, with no special markup. This is the easiest approach for Bible translators, and meets all needs from drafting through publication in print on in electronic books.
2. Use markup to generate the quotation punctuation on-the-fly. This is best for those who religiously hold to this philosophy, but has serious practical implementation drawbacks, and incurs a substantial risk that punctuation will be displayed that is contrary to the desires of the original translators. This, in turn, brings up copyright compliance issues, especially if it is the only method in use.
3. Use markup to contain or represent the punctuation verbatim (either manually inserted or inserted by a process specified by the initial translator or encoder). This might make it easier to process the text in various ways, including some possible future uses in Bible study software searches.
The oldest manuscripts of the Holy Scriptures don’t really have character styles and rich text, for the most part. Two notable exceptions are the use of a special holy pen and ink used only for the writing of God’s proper Name in the Old Testament (יהןה) when copying Scriptures. The other exception, not really visible in surviving Greek manuscripts, is Paul’s own handwriting in the closing of his letters. (Obviously, Paul’s own handwriting is not on the handwritten copies of those letters made many years ofter Paul went home to be with the Lord.)
Modern languages, including those with newly-invented writing systems, have been influenced by the availability of character attributes, such as bold, italics, bold italics, small capitals, colored text, highlighting, alternate typefaces, different sizes, etc. These have been available for a long time, to one degree or another, but the rise of computer technology has made these easy to use. Traditions have arisen in Bible publication, such as the use of red letters for direct quotes of Jesus, small capitals for representations of God’s proper Name and for inscriptions, etc. I have an edition of a recent English translation of the Holy Bible that does all of that, plus uses slant text (not true italic) for Old Testament Quotes in the New Testament. There are also different paragraph styles applied to text to format it as normal poetry and prose in the target language, instead of one long stream of writing like the original languages had. The inference of this structure from the text is subject to interpretation is many places, but it does help the text look better and communicate more clearly in target languages that always use such structures. There are also other novel uses of text attributes. For example, the Folopa New Testament uses italics to indicate words that were borrowed from another language. If a borrowed word has Folopa affixes attached to it, then the word transitions from italic to normal text in the middle of a word. The creativity of Bible translators seems to be unbounded. No matter if you personally like such use of physical text attributes or not, they are in use.
The commonly-agreed-on philosophy of Bible file formats are that they should represent the meaning of a text attribute, and let a display process (i. e. typesetting, conversion to HTML, display by a Bible study program, etc.) decide what, if anything, to do with that attribute in terms of physical text attributes, where appropriate. For example, direct quotations of Jesus Christ may be marked as such, and it would be up to the publisher to render such text as red, render it the same color as the rest of the text, or render those sections in some other variation in font attributes, like bold or a different typeface. This gets into another religious disagreement. There are some people who believe that it is wrong to display direct quotations of Jesus differently, since the whole Book is inspired by the Holy Spirit, anyway, and some people get the strange idea that only the red text is to be believed. Others believe that it is a useful help to the reader to see the red text, and that Bibles so marked sell better. If such markup is used, it should be accurate. In the case of the traditional red letter edition, the meaning of when Jesus is speaking or not is clear from the context, regardless of color printing or not.
In some cases, some meaning is conveyed by the text attributes. For example, one strong English tradition is to render (יהןה) as LORD or Lord with small capitals (instead of Lord), or when used in conjunction with Lord, as GOD (instead of God). That is kind of subtle, and gets lost when reading aloud, but it does give the reader a way to know which Name was used. In cases where meaning is conveyed by markup that isn’t otherwise conveyed by the text, then I believe it is important to preserve it in a display process.
Text attributes are also used to provide a separation between the canonical text and extra helps, as they are displayed. For example, publisher’s section titles should be displayed differently to avoid the appearance that these are part of the actual Bible text. When encoding a Scripture portion where certain character styles have been chosen for use by the translators and publishers, it is a good thing to be able to encode those that have been chosen. A near universal set of such styles can be found in the USFM specification. (Additional cases are probably rare enough that it is reasonable to handle those as exceptions rather than adding them to the standard list.)
When character styles are used extensively in the same Scripture portion, certain logical attributes will overlap. For example, if you mark direct quotes of Jesus Christ, Old Testament Quotes in the New Testament, and supplied words, you could find places where all three attributes apply. If text attributes are chosen that do not interfere with each other to represent those attributes, then that kind of stacking of styles can easily be printed. For example, using red, small caps, and italics, respectively, in the previous example would work fine. (That is exactly how some editions of the New American Standard Bible are printed.) Representing such style stacks is hard to do if character style regions are not allowed to stack or at least nest. It is also hard to properly represent such style stacks in terms of named character styles in a word processor or desktop publishing program unless that program supports stacking of named character styles. An alternative is to not allow style stacking exactly, but to create combination pre-stacked styles for every possible combination. (This could be a lot of extra styles, but it turns out that the list isn’t to scary, because not all combinations can actually occur in the Holy Bible.)
The most elegant solution for handling style stacking situations, in my opinion, is to allow named styles to stack, and have the display processes deal with those stacks on the fly. This isn’t too painful when using style sheets that specify one or more character attributes corresponding to a particular named style, with everything else defaulting to the next underlying character or paragraph style attributes. An alternate workable solution is to not allow character styles to stack, but have lots more character styles. (Have fun counting which ones are needed—especially when someone wants to invent a new one for borrowed words or for helps sections.) Currently, I see more of the latter in use than of the former, unfortunately. OSIS seems to be able to handle character style stacking well enough, but USFM will have to be extended to support character style stacking. (Such a change is being considered as I write this.)
Another alternative to solving the style stacking problem is to look the Bible translators all in the eye and tell them they can’t use more than one text style at a time, or that they just can’t use any that weren’t in the original manuscripts. This may not promote acceptance of your favorite Bible file format, but it would solve the problem if they actually acquiesced to your request.
Up until now, we have been looking at broader issues, but there are some important issues regarding the form the Scripture files take at a lower level. This includes what mapping between characters and code points (binary numbers) is used, and what kind of approach is used to separate text and markup.
In the early days of computing, several standards for encoding text were used. These normally only handled the basic Latin alphabet, common punctuation, numerals, some control (non-printing) characters, and perhaps a few assorted symbols. The most significant of these was the 7-bit American Standard Code for Information Interchange (ASCII). Soon we needed more than 128 code points, and another bit was added, then “code pages” to select alternate character sets for the upper 128 characters. It became common practice to use a custom encoding and a corresponding custom font to represent text in languages other than English and languages with a similar alphabet. These encodings became a virtual Tower of Babel on their own, until Unicode came along with a much larger number of code points allowed. Unicode started out with 16 bits per character, but grew to 32. The most popular way of representing Unicode text today is UTF-8, where the most commonly used characters (at least in majority languages based on the Latin alphabet) are represented with just 8 bits, but others are represented with more (usually 3) bytes, as necessary to represent the given character.
Using Unicode make make a document take up slightly more space than it would with a custom code page and corresponding custom font, but it helps get rid of lots of confusion and makes the text easier to interpret after pulling it from archival storage many years later. I like Unicode, and encourage its use.
One of the earliest Bible text markups to become widely used is called “Standard Format.” In this format, markers usually start at the beginning of a line (with some exceptions in later variants of this format), always start with a backslant (\), and are separated from the text they mark by a space. Sometimes (depending on the marker) they have an argument or attribute that follows the marker, also separated by a space. In Unified Standard Format Markup (USFM), the most recent and most well-thought-out version of Standard Format, some markers that are intended to set off character styles also have end markers, that are the same as the begin markers, except that they are followed by a * instead of a space.
Standard Format can (and is) used for dictionary data and other simple database items as well as Scripture text. USFM is restricted to Scripture text and helps commonly found bound together with a printed Bible. Standard Format is easy to parse with a computer program, and is easier to edit by hand and more tolerant of errors than XML. It also does not have any problem with overlapping entities found in Scripture texts. Standard Format isn’t the only similar format. Rich Text Format (RTF) and TEX both use backslant characters to indicate the beginning of markup elements.
Extensible Markup Language (XML) really isn’t a language but a metalanguage for describing languages where data is tree-structured. It defines clear ways of separating markup, attributes, and data. It is all a text format that does not allow nonprintable binary data. Strict nesting of elements is required, with no overlapping allowed. If the structural rules of XML are not followed, compliant XML readers are supposed to stop processing and reject the data rather than try to attempt recovery. XML’s strict insistence on strict nesting makes it seem unsuitable for Scripture texts at first glance, but with a little “cheating” by putting empty elements as markers to mark the start and end of overlapping entities, XML can be made to work well for representing Scripture texts. Because XML has become a popular standard for a variety of reasons, there are lots of third-party XML parsing, writing, and transformation software tools available. Therefore, it seems logical to try to take advantage of these in making a Scripture file format specification.
To represent anything properly in XML, you define a language using a schema (or alternatively, a dtd) and document its use. That has been done several times in the case of Scripture texts, resulting in several incompatible schemas, including Zefania, XSEM, OSIS, and USFX. By “incompatible,” I mean that you cannot losslessly convert all possible valid documents from one to another and back again. The basic canonical Scripture text and book, chapter, and verse markers would survive, but auxiliary material, helps, metadata, formatting information, and (in some cases) punctuation may not survive the transformation. If the things that are lost are not important to you, then adequate converters between these formats, as well as between these and probable future Scripture schemas, is possible. The best way to represent Scriptures in XML depends strongly on your philosophies and what you intend to do with the encoded text.
The naïve programmer’s view of the Holy Bible is a database of verses. That is exactly what you get with most Bible study programs. That is also why it is hard for some programmers to even think about preserving line breaks in poetry and text attributes. However, some sophisticated programmers have upgraded their idea of a Scripture database to include rich formatted text for each verse. (I’m almost surprised at the high-priced Bible study programs that still just return unformatted text for each verse.) The database underlying the Scripture modules as installed may be some sort of SQL database (used by Translation Editor), Microsoft Jet (used by e-Sword), a cross-platform custom database (used by The Sword Project), etc. Just about any database can be made to work. The real question is what is stored in the database, and how flexible the key structure is at handling slight variations in versification and verse bridges, intelligent side-by-side display of translations with slightly different versification. Does it handle just the 66 canonical books, or does it handle deuterocanonical books as well? Is the stored text richly formatted? Can you reconstruct proper poetry and prose from a range of verses?
Scripture file formats based on a database are as proprietary as the underlying database, and generally limited in usefulness to just one program, or maybe one suite of programs from the same vendor. Of course, if a function to import and export to an open text-based format is available, this need not be a problem.
The less demanding you are about what features of a Scripture text you wish to encode, the more choices you have in existing text and binary formats. One common one is a simple one-verse-per-line format, with the Scripture reference given at the beginning of each line. Of course, without additional markup, this doesn’t do much for you if you want to encode footnotes, poetry and prose formatting, etc., but it is there. One old format that was invented in the absence of knowledge of Standard Format is the General Bible Format (GBF). It uses markers delimited with angle brackets, but it is not XML. It is a good example of a minimalist markup that handles book/chapter/verse, poetry & prose, headings of various sorts, words of Jesus Christ, some support for helps, footnotes, a minimal amount of metadata, the canonical text (of course), and not much else. It is still in use for at least four projects that I’m aware of, but it will probably be replaced with an XML format of some sort eventually.
Where small size of Scripture files is important, or where non-text elements are included as helps, binary formats may be preferable to text-only formats like XML or USFM. There are several ways to do this, like using tagged packets, binary indexes to data, etc. Most examples of this kind of format are proprietary.
The main disadvantage of text formats are (1) they are space hogs compared to similar binary formats that have been designed for efficiency, and (2) they may be too easy for people to edit or convert to other formats. (The latter is also an advantage if you want to edit or convert to another format, but if you are a publisher or a software developer trying to get a copyright owner to allow you limited use of a copyrighted Scripture text, it could be a problem that prevents you from using that text at all.) One simple way to convert a space hog into a smaller format is to simply compress the format with a standard file compression algorithm of some sort. A good example of the effectiveness of that approach is the Open Document Text format, which is basically XML compressed with ZIP compression.
For use on a modern PC, compression built into the Scripture file format is probably not necessary. On a handheld computer or other small device, it might be of significant benefit.
Cryptography offers two main benefits to Scripture file format use: tamper detection through digital signatures and point of sale control through encryption. The former is a boost to confidence in digital texts. The latter may be a condition of getting some copyright owners to allow you to even use their text in a digital product.
Digital signatures could be built into a Scripture file format. Alternatively, it may make sense to simply take advantage of external digital signatures made with software like Gnu Privacy Guard. Not being one to like reinventing wheels, I recommend the latter. I routinely use Gnu Privacy Guard to digitally sign Scripture files distributed from http://eBible.org.
Encryption for point of sale purposes can be handled externally, too, although some copyright owners like the idea of the text remaining unreadable on disk, even after the sale. That pretty much implies that the encryption must be embedded in the product that displays the copyrighted text. The encryption algorithm(s) used could be open and publicly known as long as the key(s) remain secret. It turns out that keeping such keys secret against a determined attacker is a hard problem, usually requiring some kind of secure hardware... but for securing Bible study program Scripture files, the electronic equivalent of a standard door lock is normally sufficient. Those can be bypassed, but most people won’t. By its very nature, application of encryption to a Scripture file renders it unsuitable for archive use, but application of a well-designed digital signature, especially an external one that does not change the format of the signed file at all, does no harm.
Most people would say that it is good for a Scripture file format to be maximally compatible with a variety of applications that generate, check, publish, display, and facilitate study of Scriptures. They would also like enough flexibility and extendability to handle unforeseen situations and to encompass a wide variety of uses. These virtues are at odds with each other. The more flexible a format is, the harder it is to make other applications compatible with it, because that flexibility must be programmed many times. The more extensible a format, the more likely someone in the future won’t know what to do with an extension. When many alternatives are provided to encode the same thing, confusion is more likely when interchanging or converting data.
Compatibility and archive suitability are essentially the same thing, except that in the case of an archive, the compatibility is with some probably unknown future application. The keys to compatibility are simplicity, clarity of documentation (making differing interpretations of the same document less likely), appropriate limitation of scope, and mechanisms that make clear how use of extensions should be handled (i. e. required internal documentation of what they are and what they mean, setting proper expectations of how such extensions should be handled by existing software, etc.).
Simplicity is extremely important. By simplicity, I don’t mean that complex situations or richly-formatted text cannot be handled, but that the way such things are handled are no more complex than necessary. Simplicity in Scripture file formats yield simpler programs to read and write those formats and fewer compatibility problems. This, in turn, yields more reliable software delivered sooner. Sometimes to maintain simplicity, it is better to declare a particular application outside of the scope of the specification, or at least document such uses to be in an extension to the specification that not all users of the basic specification need to support. Two extremes with respect to simplicity while still being able to handle mostly the same features of the same Scripture texts are USFM and OSIS. USFM is almost as simple as it could be and still handle all of the situations it handles. OSIS, on the other hand, has all kinds of excess baggage in the form of both encoding artifacts that could be made much simpler and still do the same job and items that could be encoded that most people would never use. (Even if USFM were extended to handle everything OSIS could, like extra metadata, it would still be much simpler.)
Ease of conversion of existing Scripture files to any proposed new format is a major consideration when the object is to improve on a format that has been in use for any significant length of time. If such conversion cannot be done quickly and losslessly, it is likely that the new format will not be considered worth the effort to convert to.
There is a strong desire among many to have a common Bible file interchange format standard. It seems to me that the first XML schema that is good enough is likely to fill this void, regardless of if it is the best or not. Once it becomes reasonably widely used, it will be difficult to replace with a better standard because of the large mass of encoded texts and software written for each other. This is reminiscent of the VHS vs. BetaMax video tape cassette format battle. BetaMax was technically superior (if slightly more costly to manufacture), but VHS won, because it got to a critical mass of tapes recorded and machines in use first. Of course, that victory was not forever, as DVD has taken over the same place of honor, and that, in turn, will be displaced in the future.
One significant difference between the video tape scenario and file format specifications is that automatic converters can be made to change one format to another, and if that conversion is reversible with acceptable quality, then an upgrade path is easier to follow than it is with a warehouse full of the wrong kind of video cassettes. Therefore, it is not necessary that just one standard prevail, like it was with video cassette tapes. However, having too many specifications coexisting at once could cause some confusion, unless different standards were confined to different domains or niches of application. It may not be the winner-take-all of the video cassette standard, but there is still a perceived need for a common Bible file format standard for interchange between different domains, times, users, and programs. It may be that a poor standard is better than no standard. Almost any decent standard looks good until competition arises. Sometimes the “marketplace” of Bible file format users may surprise you, just as the mass decision of lower cost over quality made the difference for VHS.
Probably a better model for data formats is found in the area of word processing. At one time, WordPerfect ruled the market. WordPerfect documents were the standard format used by lawyers and others who had a large inventory of documents to maintain, edit, and adapt. The majority of word processor users were used to WordPerfect. This is no longer true. What happened? Microsoft worked hard to come up with a word processing program that was not only better in many ways, but which could also read and write WordPerfect files. They even wrote special help for WordPerfect users, telling them how to do things the new way. Apparently, it worked. I don’t think it was just agressive marketing. I think Microsoft did something smart. Today, I watch with a degree of amusement as OpenOffice.org Writer (backed by Sun Microsystems and a team of open source developers) is using the same technique (except for the big marketing campaigns) in a bid to compete with Microsoft Word. The conversions from WordPerfect to Microsoft Word document format were not perfect, but good enough. OpenOffice.org Writer needs some improvement in its conversion quality, still, but the idea is good. Still, they are good enough for the most common documents. There are also other competing word processing programs, all of which can import and export a variety of formatted text document formats, including competitor’s formats. As a result, multiple document standards and multiple programs peacefully coexist, with none taking 100% market share.
Data convertibility and compatibility between different formats eases the monopoly pressure, and also reduces the risks associated with choosing a standard.
In the case of Bible file formats, I think that it is probably better to have at least two, with conversion between them made easy: a format optimized for drafting or initial encoding of Scripture translations, and a format optimized for display using a Bible study program. The former would major on the text format as poetry and prose, with verse markers added as milestones. The latter would be strictly compartmentalized by verses, but still include formatting information to display a range of one or more contiguous verses with proper formatting.
Being able to convert freely between different ways of representing the same Scripture texts is a good thing, in theory. In practice, conversion between Bible file format specifications is plagued by the same kinds of problems as conversion between word processing format standards. The results can be similar. The basic text and verse markings get converted, but formatting, supplemental material, and variations in versification may get lost, changed, or converted to another kind of attribute.
Challenges to be overcome in converting from one Bible file format to another include:
Incompatibilities due to differences in philosophies underlying different specifications
Inconsistencies in character encodings used (hacked fonts and custom character sets)
Different data granularity of data fields, i. e. one field for the whole copyright notice vs. separate fields for copyright dates, owner(s), license and permission notices, trademark notices, etc.
Features supported in one format but not another
Features that don’t map one-to-one, so that conversion is ambiguous
Reliance on external information not present in the encoded text
One potentially large barrier to conversion quality is possibly not so obvious: the differences in philosophies behind the various specifications. For example, if one specification cannot handle stacked character attributes, it is hard to convert to that format from a format that does handle those, without risking losing information. If one handles Words of Jesus Christ as a character attribute only, doesn’t allow that attribute to cross verse or paragraph boundaries (like USFM), and another handles Words of Jesus as a special case of a quotation that is marked only at the beginning and end of the quote that also is expected to generate quotation punctuation on the fly, round-trip conversion gets complicated (not to mention language-dependent).
The simpler the markup involved and the more standard the versification, the more likely it is that a Scripture file will survive a conversion to another standard unscathed. More complex cases will almost always lose some kind of formatting or peripheral data.
Different people have different ideas of what a standard is. To some people, a specification is not a standard unless it has been officially accepted by some International standards approval body or maybe a national government. Others consider a specification to be a standard if it is endorsed and promoted by just one organization, like Microsoft and the Rich Text Format or Adobe and the Portable Document Format. To others, a standard is any specification that is widely used. The most useful standards are those that are widely agreed on, widely used, and that have a clear keeper of the standard. In some ways, it matters little who the keeper of the standard is, as long as users of the standard don’t object. The keeper of the standard has a responsibility to clearly publish the official standard and to respond to requests for clarification of the standard as well as requests for amendments or enhancements. That doesn’t mean that all requested changes will be made, but that they will at least be considered.
Considering changes to a data format standard of any sort, including Scripture file formats, should involve wise evaluation of the impact of changes on existing data archives, existing software, and existing documentation; carefully weighing benefits and costs of such changes and possible alternate changes.
Maintenance of some standards involve serious economic implications and sometimes involve power politics. Sometimes a specification wins acceptance in a marketplace long before it is accepted as a standard by any official standards body. To my way of thinking, wide acceptance is more important than official blessing. If the International Standards Organization or the United Nations were to promote a standard for the production of perpetual motion machines, it might be very official, but also very useless. On the other hand, if a previously obscure group promotes a standard for hyperlinked text documents spanning the world over the Internet and it gains the support of several software developers, and if they grow and adapt the standard based on user feedback, it could (and did) have a profound impact. Standards don’t do anything by themselves. They just make it easier (and sometimes possible) to do some other things.
A lot of incompatibilities and costly implementation problems can be avoided by wise choice of Scripture file formats to use. My main interest in Bible file formats is in representing Scripture texts for initial translation and publication in both print and electronic formats, and also for use in Bible study programs. Compatibility and stability for archive purposes are very important considerations, as are ease of use for both end users and programmers and preservation of important attributes and (of course) the Sacred Text itself. The closest thing to that ideal that has reasonably wide acceptance is USFM. There have been several efforts to improve upon USFM, and such an effort will almost certainly succeed in the future, in spite of serious problems with the efforts made public at the time I am writing this.