Contributed by Jonathan Pool on 2010-02-03.
As of now, the PanLex database contains asserted lexical facts acquired from about 600 sources, mainly machine-readable bi- and multilingual dictionaries, vocabularies, thesauri, and standards.
Individual editors have harvested the assertions from the sources, in some cases with item-by-item inspection and judgment, and in other cases with ad-hoc rules encoded into scripts.
In principle, assertions from a machine-readable lexical resource could be harvested with a simple script that merely converts one format to another. A few (roughly 5%) of the resources are processable in this way.
However, the vast majority of resources are difficult to process, mainly because their formats have one or more of these properties:
I summarize here some recurrent obstacles to the straightforward harvesting of assertions.
Source-target delimitation. Some resources append translation targets to translation sources without delimitation, requiring it to be inferred.
Lemma series delimitation. In many entries resources translate single source lemmata into multiple target expressions but do not clearly delimit the target expressions. A common delimiter is the comma, but many resources use this punctuation mark both as an inter-lemma delimiter and as an intra-lemma character. For example, an entry in HanDeDict contains the target list “Kreditkartensklave, jemand, der auf Kosten seiner Kreditkarte lebt”, in which the first comma separates two translations while the second comma is a punctuation mark inside the second translation.
Meaning distinction. When a resource translates a lemma into multiple target expressions, it often doesn’t consistently indicate whether the translations are synonymous. Some resources separate synonyms with commas and distinct meanings with semicolons, but most apply no such rule consistently.
Attribute specification. Very few sources explicitly name the attributes whose values they provide. For example, after a lemma there may be a parenthesized string, and from entry to entry its attribute may silently vary, being a word class (part of speech) here, a domain descriptor there, and an argument type elsewhere.
PDF corruption. Many sources are available only as PDF files that are difficult to convert to plain text, because they are bitmap page images, they are laid out in visual rather than logical order (e.g., crossing column boundaries on each line), or when exported as HTML or XML they become corrupted with extra spaces, space deletion, character-order reversals, and other errors.
Character encoding. Many resources rely on fonts with non-Unicode encodings. Only some such fonts have been equipped with Unicode conversion maps.
Nonlemmatic case. In languages with letter case, some resources use nonlemmatic case conventions, such as using entirely upper-case letters or beginning every lemma with an upper-case letter, thus eliminating case-indicated distinctions (such as “turkey” versus “Turkey”).
Nonlemmatic inflection. Some resources, such as library subject-heading lists and thesauri, pluralize some lemmata, particularly count nouns.
Arguments. Resources vary in their treatment of arguments. A lemma may be realized in several different ways in different resources (e.g., “find fault” versus “find fault with” versus “find fault with sb” versus “find fault w. sb” versus “find fault with someone”).
Ellipses. Lemmata that contain separated elements, such as “both … and”, are composed variously.
Personal names. Resources vary in their treatment of personal names, in some cases inverting surnames and forenames.
Definitions. PanLex distinguishes lemmata from definitions. Resources often don’t. One entry may have a lemmatic target while the next entry’s target is a long description in the target language of the meaning of the source lemma. Resources also rarely distinguish thorough from underspecified (e.g., “a kind of mushroom”) definitions. Finally, translations often combine lemmata and definitions in various ways, such as “trance (typ., induced by hashish during funerals)”.