Lecture Summary (01/23/2007)

This time the topic was "Computational Lexicography".

The lecture started with a review of lexicography principles. A good lexicography needs a certain quantity meaning completeness of coverage. This includes the number of entries (extensional coverage) and the number of types of lexical information (intensional coverage). Another aspect is of course the quality of the lexicography. The information given has to be correct (types of lexical information) and the structure must be consistent (macrostructure, microstructure, mesostructure).

After that Mr Gibbon showed us a model to illustrate the so called lexicographic workflow cycle.

The process starts with the data acquisition:

  • Recordings
  • Text collection
  • Concordance
  • Dictionaries
  • ...

The next step is the lexicon construction:

  • Metadata
  • Information retrieval
  • Linguistic analyses

The cycle continues with the access to data:

  • Traditional print media
  • Hyperlexicon: CD, internet
  • Software with lexicon components: word processing; speech processing

Finally the lexical evaluation takes place:

  • Internal: consistency; completeness
  • External: utility for the users

Then we took a closer look at lexical data acquisition, especially at concordances. Mr Gibbon introduced the term KWIC (KeyWord In Context) concordance and called it a special kind of preliminary, corpus-based dictionary. In a KWIC concordance each word in a text corpus is paired with its contexts of occurence in this corpus. Just a short example for a KWIC concordance with right-hand contexts: in the (very small) text corpus "I like football" the word "I" would be paired with "like" and "like" would be paired with "football". Of course Mr Gibbon chose a more complex example to illustrate this concept: he showed us an extract out of "Notes from a Small Island" by Bill Brystol followed by the respective KWIC concordance. Besides we learned that Google is a special form of KWIC concordance.

The process of creating a KWIC concordance contains basically six steps:

  1. Corpus creation: creating a corpus of texts in electronic format
  2. Tokenisation: eliminating punctuation marks and capital letters plus breaking the text into context units
  3. Keyword list extraction: listing up all words of the corpus alphabetically and removing duplicate words
  4. Context collation: pairing the keywords with their respective left and right contexts
  5. Search: searching for KWIC in corpus
  6. Output format

Afterwards Mr Gibbon showed us the process of computing a KWIC concordance. Finally we found out that KWIC concordances make the search for lexical information more efficient by putting information about words in one place.

