WISPR - Welsh and Irish Speech Processing Resources
Welsh and Irish Speech Processing Resources


Optese: Optimal Text Selection

Author: Ivan A. Uemlianin
Contact: i.uemlianin@bangor.ac.uk
Copyright: 2005, University of Wales, Bangor



This software was produced as part of the WISPR project funded by Interreg IIIA European Union Programme and the Welsh Language Board.


Optese implements the 'greedy-set-cover' algorithm from Cormen et al. (1990) [CORMEN90] for optimal text selection. Given a list of sentences in two formats - orthographic and phonemic - optese will return a minimal list of sentences (orthographic) which contains all of the diphones in the input list.

Set up


optese.tar.gz includes all files, and can be downloaded from here. The individual components can also be downloaded from here.


As long as you have python, there is no installation necessary. The script optese.py should just run as it is.


Calling optese.py -h will output this brief usage reminder:

* Optese: Optimal Text Selection

** Usage: optese.py (options) (input_prefix)

-c fn: name of config file containing parameters
-c: print default parameters
-h: print this message and exit

-f n: min frequency for input units
-e n: min examples to collect of each unit
-g fn: got these diphones from earlier iteration

-v: verbose mode
-o fn: prefix for output filenames
-p: give phonetised text output (as well as orthography)
-n: run not_covered.py

** Default values for Optese parameters:

min_examples_to_get = 1 # Get at least this many of every unit
min_coverage = 1 # Gather only units which have
# at least this many examples
get_not_covered = True # Do produce 'units not covered' report
phon_out = True # Do output phonetised text
verbose = True # Do print progress indicators
in_prefix = corpus
phon_in_fn = corpus_phon.txt
orth_in_fn = corpus_orth.txt
out_prefix = subCorpus
phon_out_fn = subCorpus_phon.txt
orth_out_fn = subCorpus_orth.txt
unit_out_fn = subCorpus_units.txt

Calling optese.py -h also shows the format necessary for config files: the general idea is one 'name = value' statement per line, and anything after a '#' is ignored. The statements should follow Python syntax: for example, strings should be quoted.

When optese.py is called without arguments it will look for the files corpus_orth.txt (list of sentences) and corpus_phon.txt (sentences in 'phonetized' form: see below) in the current directory. It will out put progress indicators looking something like this:

$ ../optese.py
Reading corpus ... Done: read in 9321 sentences.
Running OTS ...

1 next: 3893 len(slist): 9320 len(to_remove): 302
2 next: 8630 len(slist): 9319 len(to_remove): 189
3 next: 5964 len(slist): 9318 len(to_remove): 169
4 next: 8655 len(slist): 9317 len(to_remove): 131
5 next: 1336 len(slist): 9316 len(to_remove): 124


1993 next: 9179 len(slist): 7 len(to_remove): 1
1994 next: 2320 len(slist): 6 len(to_remove): 1
1995 next: 8120 len(slist): 5 len(to_remove): 1
1996 next: 4689 len(slist): 4 len(to_remove): 1
Done: found 9268 diphones; using 1996 sentences.
Getting not covered ... Done.

Depending on your machine, a corpus of 9321 sentences should be processed in a few minutes.

The output will comprise these files:

List of selected sentences (orthographic).
List of selected sentences (phonemic).
List of chosen units, with their counts. The line format is count unit, as in: 61   #1@-ng.
List of units not found in the input. As this list is likely to be quite large, it is formatted in html for ease of navigation. A table of contents lists 'basic' (i.e., context-independent) monophones and diphones. The body of the report is two tables showing, for each 'basic' unit, which context-dependent units are missing. Anything with more than half of its context-dependent units missing is highlighted in red in both the toc and the body of the report. An example is here.

Format of 'phonetized' sentences

Here is an example of an orthographic sentence (of Welsh) and its phonemic representation:

Ewch i fyny, ac fei cewch ar unwaith.


As well as the phones themselves there are:

  • Stress markers: numbers 1-4, before vowels;
  • Word separators: '#';
  • Phrase separators: '/';
  • Utterance boundaries: '$'.


(This section may not be very readable;)

command-line user interface


  • orth text input
  • phon text input
  • min freq for input units
  • min examples to collect of each unit


  • verbose/quiet
  • orth text output
  • phon text output
  • got these units
  • haven't got these units - inc. units which were present but below inclusion threshold?


  • min_examples_to_get > 1


  • definitions: - naive/generalised units (old-style optese) - context-sensitive units (hpi)
  • optese
  • not_covered


[CORMEN90]Cormen, Thomas, H., Charles E. Leiserson, and Ronald L. Rivest. (1990). Introduction to Algorithms. Cambridge, Ma.: MIT Press.