![]() |
|
As the JSTOR archive expands to include new content in new subject areas, new challenges are introduced in our continuing efforts to provide "faithful replications" of the journal content as it was originally published. While JSTOR is primarily an English-language archive, we are finding that as the breadth of our content expands, we are increasingly working with languages other than English. Additionally, the metadata that we create for this non-English journal literature includes non-alphabetic characters that must be captured and displayed. Prior to two years ago, JSTOR's efforts in this area primarily relied on the use of LaTeX, a markup language intended to represent mathematical formulae and scientific notation. While JSTOR still employs LaTeX where appropriate, with the introduction of the Arts & Sciences II and Language & Literature collections, both of which contain numerous non-ASCII characters in their corresponding metadata, JSTOR decided to pursue an approach to non-ASCII keyboard characters that was based on a more broadly accepted standard. After careful internal discussion and consultation with various participating libraries and publishers, JSTOR adopted the Unicode standard, which is widely recognized and has been successfully implemented within the academic and publishing communities.
What is Unicode?1
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Unicode provides a unique number for every character, no matter the platform, the program, or the language. The Unicode standard is organic and is continually expanding and offering new character sets to its users. For example, one among a number of proposals currently under consideration by the Unicode consortium is a hieroglyphics character set.
Unicode at JSTOR
As with any change of this kind, the implementation of Unicode meant a significant number of changes behind the scenes at JSTOR. Additionally, it was recognized that while the use of Unicode was a step forward in displaying characters, searching functionality was still limited to the use of ASCII characters. Therefore, along with the implementation of Unicode came a decision to accompany Unicode representations with corresponding transliterations that were searchable using standard US keyboard characters. By using standardized transliteration schema, JSTOR embraced a long-standing approach to representing non-Latin characters that provides scholars a familiar way to use the JSTOR metadata in their search strategies. One of the inherent challenges of this approach, however, was that many non-Latin orthographic systems have more than one standardized transliteration scheme, and JSTOR needed to select one for use and then be consistent in its application.
Since adopting the dual Unicode/transliteration approach, JSTOR has developed processes for handling many ongoing challenges. During the pre-digitization preparation phase of a journal, the JSTOR Production Librarians' review of the journal's back run includes identifying any as of yet unencountered orthographic systems and/or characters that will appear in the JSTOR metadata. When these are located, we search through the extensive Unicode encoding charts to identify the character system if it is not obvious from the context in which the text appears. Then, we document the appropriate character system in our indexing guidelines. In cases of individual non-alphabetic characters, we also provide the precise Unicode values and transliterations to be captured in the metadata files.
However, when new-to-JSTOR orthographic systems are encountered, a more robust process kicks in. When the orthographic system is identified, the Production Librarians contact JSTOR's Publisher Relations, Library Relations and User Services units for assistance in locating and contacting potential consultants among the JSTOR community who possess the necessary language expertise to assist us in selecting the most widely accepted and practical transliteration scheme. After consulting with these experts, a transliteration scheme is chosen. This choice is captured in the indexing guidelines as early as possible so that we may ensure that our indexing vendors have the necessary skill sets available to them as they process these journals. Still, it is not unusual for JSTOR to field questions from the indexing vendors who wish to ensure that they are employing Unicode standards and transliteration schemes in a manner consistent with JSTOR production standards. If JSTOR is unable to answer these questions with in-house staff, we then turn to a variety of knowledgeable and helpful experts in the University of Michigan and Princeton University communities, where JSTOR's Production units are located.
Unexpected Outcomes: The Case of Alif
JSTOR's attention to detail also extends to our quality assurance checks on the post-digitization data that is returned from our indexing vendors. In cases where significant content has been identified in languages where we do not have in-house expertise, JSTOR hires consulting staff to review and revise both the Unicode encoding and the accompanying transliterations. In some cases, these reviews reveal unexpected challenges which then need to be addressed. This was the case for Alif: Journal of Comparative Poetics, a title in the Language & Literature Collection that contains many articles in Arabic. The review of Alif showed that abstracts primarily in Arabic (a script reading from right to left), contained discrete words in English or French (languages whose orthographic systems read from left to right). This presented some logical difficulties for web browsers trying to decipher in which direction the text should be presented and read. Consequently, lines of text were being displayed out of order. A study of the Unicode standard yielded a solution in the form of "bidirectional" tags, which could be inserted into the metadata in order to force the text to display correctly.
Still, in other cases, the use of Unicode may not be practical or appropriate. As we prepare for the upcoming Music Collection, for example, we have learned that Unicode cannot be used to display more sophisticated musical notations, where various pitches are represented. In such cases, a standing team of staff representing the various units of the JSTOR organization considers the particular challenge at hand and recommends a strategy for research and resolution. Individual team members follow up with their units and with external contacts until a satisfactory resolution is identified and adopted. As the JSTOR archive grows and as the content in the archive becomes more complex, we will continue to strive to ensure that participants have an accessible product that is faithful to the original publication. No doubt, new and more difficult production challenges will arise as new collections are added and new disciplines included. However, we welcome those challenges in the anticipation that their resolution will yield a better resource for the libraries and scholars that have come to rely on the JSTOR archive.
For more information on character display within JSTOR, go the JSTOR website at
http://www.jstor.org/help/generic.html#character , or email support@jstor.org.
Last updated on September 8, 2006
©2000-2007 JSTOR