WISPR - Welsh and Irish Speech Processing Resources
Welsh and Irish Speech Processing Resources


SpeechCluster README

Author: Ivan A. Uemlianin
Contact: i.uemlianin@bangor.ac.uk
Copyright: 2005, University of Wales, Bangor
Date: 2005/12/16



This work was carried out as part of the project 'Welsh and Irish Speech Processing Resources' (WISPR; Prys, et al., 2005). WISPR is funded by the Interreg IIIA European Union Programme and the Welsh Language Board. I should also like to acknowledge support and feedback from other members of the WISPR team, in particular Briony Williams and Aine Ni Bhrian.


Speech technology research & development includes a lot of very exciting challenges. It also includes a lot of tedious and repetitive data manipulation. Like most people, I write scripts to help me process the data. As the scripts are reused and extended, I continuously refactor them, and SpeechCluster and its associated tools are the result.

There is already a wide range of free software tools available to help the speech technologist in their work. Unfortunately, although each tool can perform its task extremely well, these tools are often not as good at talking to each other. As well as alleviating repetitive tasks, SpeechCluster can act as an abstraction layer above these various tools. I have found SpeechCluster useful when working with the following:

Title Address
Emu http://emu.sourceforge.net/
Festival http://www.cstr.ed.ac.uk/projects/festival/
Festvox http://festvox.org
HTK http://htk.eng.cam.ac.uk/
Praat http://www.fon.hum.uva.nl/praat/
Sphinx/Train http://cmusphinx.sourceforge.net/html/cmusphinx.php

I find the scripts very helpful in my work. They are available here under a BSD-type open-source licence. If you find them useful, if you write a script that uses SpeechCluster, or if you think of a useful extension, please let me know.

If you use SpeechCluster or its associated tools in published research, please use this citation in your references section:

Uemlianin, I. A. (2005). "SpeechCluster: A speech database builder's multitool." Lesser Used Languages & Computer Linguistics proceedings. European Academy, Bozen/Bolzano, Italy.


SpeechCluster and the associated command-line tools are all written in Python. Most Linux distrubutions come with Python as standard. If Python is not installed on your system, it is very simple to install (especially on Windows and MacOS X systems). The Python home page is here: http://www.python.org.

That is the only requirement.

Note: at the moment, the only audio format supported by SpeechCluster is Microsoft's RIFF format (.wav). Furthermore, SpeechCluster assumes audio signals are mono. For this reason, the unix utility sox will be useful. Linux distributions will have this. The sox homepage is: http://sox.sourceforge.net.


SpeechCluster.tar.gz includes SpeechCluster.py and all the command-line tools below. It can be downloaded from here. The individual tools can be downloaded from here.

Further work

There is no development plan as such (at all, in fact): SpeechCluster accompanies me in my work and develops accordingly. Having said that, here are some of the things I am 'looking forward to':

  • supporting more audio formats, particularly those not supported by sox (e.g., esps' audio format);
  • having really comprehensive (rather than ad hoc) coverage of the formats currently supported.
  • developing a 'Corpus' class, to handle bodies of data. This will probably include interfacing with EMU.
  • improving SpeechCluster's interactions with PyHTK.


SpeechCluster.py is a python module containing some object classes - Segment, SegmentationTier and SpeechCluster - which represent speech segmentation and the associated audio. SpeechCluster can read and write a number of label file formats, and wav format audio. The Command-line tools below cover the most common use-cases (so far).

Supported Label formats include:

Format As used by SpeechCluster name
.TextGrid Praat TextGrid
.lab Emu, Festival lab, esps
.lab HTK (n.b.: different from above) htk-lab
.mlf HTK htk-mlf
.txt HTK htk-grm

SpeechCluster and the tools below can read/write/convert any of these formats in any direction.

Command-line tools

These tools can be used as they are, or they can be taken as example use-cases for SpeechCluster. If none of these tools fits your exact requirements, you may be able to change the code of the nearest fit, or even to write your own tool.


segFake.py does 'fake autosegmentation' of a speech audio file. At the moment it assumes one utterance per file, with bounding silences. segFake detects utterance onset and offset, and spreads the given labels evenly over the intervening time.

The chances of getting any label boundary correct are of course virtually zero, but I have found it quicker and easier to correct one of these than to start labelling from scratch. Correcting a 'fake' transcription is also less error-prone, as the labels to use are already provided and don't need to be specified by the user.


segFake.py -f <filename> -o (TextGrid | esps | htklab ) <phones>

e.g.: segFake.py -f amser012.wav -o TextGrid m ai hh ii n y n j o n b y m m y n y d w e d i yy n y b o r e

segFake.py -d <dirname> -t <transcription filename> -o (TextGrid | esps | htklab)

e.g.: segFake.py -d wav -t trans.txt -o TextGrid

Transcription files should be one transcription per line, of the form:

(amser012 "m ai hh i n y n j o n b y m m y n y d w e d i yy n @ b o r e.")

n.b.: This format is based on the format of Festival's prompt files. The quotes around the transcription are necessary, but the final punctuation is not. In fact, segFake strips off the final punctuation mark (if it's there) before segmenting.


segInter.py interpolates labels into a segmented but unlabelled segment tier [in Praat]. For example, you label a file phonemically, mark the word boundaries but don't type in the words themselves. If you have the text available, you can use segInter to fill in the word tier. This can save you a lot of typing and fiddling about.


segInter.py [-l <level>] -f <label filename> <labels>

e.g.: segInter.py -f amser035 Mae hi ychydig wedi chwarter i hanner nos

segInter.py [-l <level>] -d <dir> -i <transcription filename>

e.g.: segInter.py -d lab -i amser.data


  • only TextGrid label format is supported

  • the default level/tierName is 'Word'

  • labels do not have to be quoted on the command-line

  • transcription files should be one transcription per line, of the form:

    (amser012 "m ai hh i n y n j o n b y m m y n y d w e d i yy n @ b o r e.")

    n.b.: This format is based on the format of Festival's prompt files. The quotes around the transcription are necessary, but the final punctuation is not. In fact, segFake will strip off the final punctuation mark (if it's there) before segmenting.

  • segInter assumes that the first and last segments in the textGrid are silence, and adds Word-level silences accordingly (i.e. you don't have to specify them explicitly).


Merges label files into one multi-tiered label file, for example to compare different segmentations of a speech file.

n.b.: Currently only works on textGrids (and takes first tier of multi-tiered textGrids).


segMerge.py <fn1> <tierName> <fn2> <tierName> <fn3> <tierName> ...

for example:

segMerge.py eg1.TextGrid Me eg2.TextGrid Them eg2.TextGrid Fake ...


Label file label converter.


segReplace -r <replaceDict filename> <segfilename>
segReplace -r <replaceDict filename> -d <dirname>

n.b.: segReplace changes labels in place, so keep a back-up of your old versions!

ReplaceDict Format

The replaceDict file should contain the following (a python dictionary):

replaceDict = {'oldLabel1': 'newLabel1',
'oldLabel2': 'newLabel2', 'oldLabel3': 'newLabel3', 'oldLabel4': 'newLabel4', ... }


  • Quote marks are required;
  • If an oldLabel has '!!merge' as its newLabel, segments with that label are merged with the previous segment (i.e., the segment is removed, and the previous label's end time is extended).
  • oldLabels can be longer than a single label. Currently they can be no longer than two labels, e.g., 't sh' --> 'ch'.


segSwitch.py converts between label file formats, either on single files, or a directory at a time.


segSwitch -i <infilename> -o <outfilename>
segSwitch -i <infilestem>.mlf -o <outFormat>
segSwitch -d <dirname> -o <outFormat>

Formats supported

Format File Extension(s)
esps .esps, .lab, .seg
Praat TextGrid .TextGrid
htk label file .htk-lab
htk master label file .htk-mlf
htk transcription .htk-grm

n.b.: currently, segSwitch will only convert into not out of htk-grm format.


splitAll.py takes a directory full of paired speech audio and label files (e.g., wav and TextGrid), and splits each wave/labelfile pair into paired subsections, according to various split parameters such as number of units or silence (where "units" can be phones, words, silences, etc.).


splitAll.py -n <integer> -t <tierName> [-l <label>] inDir outDir

inDir should contain pairs of speech audio and label files (e.g., wav and TextGrid). splitAll will split each pair into shorter paired segments, based on the parameters given.


splitAll.py -n 5 -t Phone in/ out/         # into 5 phone chunks

splitAll.py -n 1 -t Word in/ out/ # by each word

splitAll.py -n 1 -t Phone -l sil in/ out/ # by each silence

splitAll.py -n 5 -t Second in/ out/ # into 5 sec chunks


segDiBo adds explicit diphone boundaries to label files, ready for use in festival diphone synthesis. It also outputs pitchmark (pm) files. segDiBo'd label files (fstem_dibo.ext) and pm files are output into the given data directory.


segDiBo.py -d <dataDirectory>


Trims beginning and end silence from wav files, adjusts any associated label files accordongly.


trim.py -p 1.5 example.wav  # trims example.wav leaving 1.5s padding
trim.py -p 1.5 example # as above, adjusts any seg files found too

trim.py -d testdir # trims all files in testdir, including any seg files,
# leaving .5s padding


Prys, Delyth, Briony Williams, Bill Hicks, Dewi Jones, Ailbhe Ní Chasaide, Christer Gobl, Julie Berndsen, Fred Cummins, Máire Ní Chiosáin, John McKenna, Rónán Scaife, Elaine Uí Dhonnchadha. WISPR: Speech Processing Resources for Welsh and Irish. Pre-Conference Workshop on First Steps for Language Documentation of Minority Languages, 4th Language Resources and Evaluation Conference (LREC), Lisbon, Portugal, 24-30 May 2004.

Uemlianin, Ivan A. (2005). SpeechCluster: A speech database builder's multitool. Lesser Used Languages & Computer Linguistics proceedings. European Academy, Bozen/Bolzano, Italy.