Scientifical editing

The service provider for digitization Editura GmbH OCRes the images and provides us with XML-files. To be more specific our project proceeds consequently according the the P5 Guidelines of the Text Encoding Initiative (TEI).

Hence we can guarantee a platform independent handling with the enormous stocks of the journal. Both encoding and storing, as well as an exchange of data without loss is standardized.

Markup Language

Both XML and the TEI-standard at first glance are similar to HTML which is more familiar since every webpage is using it. However they are distinctly more regulated than that: XML-files have to be well-formed and valid against a schema. Nonetheless XML is extensible, which means that it doesn't consist of a pre-defined set of tags. The different ways of extending are strictly regulated and documented. This is so because mostly you are dealing with data, which are to be handled by different people over a distinct period of time.

By the so-called ›tags‹ all relevant components of a text (i.e. titles, authors but paragraphs and pagebreaks also) are structured and therefore directly made accessible for a later time. Without these markes the digitized text would be just a sequence of not differentiated ›bits‹.

A markup language is a set of predefined rules for encoding text. Markup is always used to make explicit to the machine what is implicit to a person. Since markup is a meta language it will have to be distinguished from the ordinary text language. It will have to be defined: which markup is valid at which point, which markup is necessary and what does it mean. All these informations are hosted in an external file, in the so-called ODD (›one document does it all‹).

Just an example out of the ›Dingler‹

The »Polytechnische Journal« macht extensively makes use of emphasizing certain components of the text. Therefore in the first half of the nineteenth century they especially used letter-spacing. There can be various reasons for doing so. To be more specific Dingler spaces out words/sentences in four different cases: 1) names of persons, 2) certain important terms, 3) internal as well as external links, and 4) words to be emphasized linguistically.

In the automatically OCRed text it is just distinguished between normal and wide letter-spacing. It is not possible to name the certain function of this. In a second and not automatically proceeded step it is to be decided why in a certain case a term or phrase was spaced out by the editors.

Two examples to illustrate this: Step 1 means the automatically provided encoding by our service provider for digitization  Editura GmbH. Step 2 is to be carried out by hand at a later time, since there a content-based decision is indispensible:

1st example: terms in foreign language

1. step: <hi rend="roman">Rouge vegetal</hi>

2. step: <term xml:lang="fr">Rouge vegetal</term>

 
2nd example: spaced out surnames

1. step: <hi rend="roman wide-spaced">Anna</hi>

2. step: <name type="person" rend="roman wide-spaced">Anna</name>