Welsh and Irish Speech Processing Resources |
|
Author: | Ivan A. Uemlianin |
---|---|
Contact: | i.uemlianin@bangor.ac.uk |
Copyright: | 2005, University of Wales, Bangor |
Date: | 2005/12/16 |
This work was carried out as part of the project 'Welsh and Irish Speech Processing Resources' (WISPR; Prys, et al., 2005). WISPR is funded by the Interreg IIIA European Union Programme and the Welsh Language Board. I should also like to acknowledge support and feedback from other members of the WISPR team, in particular Briony Williams and Aine Ni Bhrian.
Speech technology research & development includes a lot of very exciting challenges. It also includes a lot of tedious and repetitive data manipulation. Like most people, I write scripts to help me process the data. As the scripts are reused and extended, I continuously refactor them, and SpeechCluster and its associated tools are the result.
There is already a wide range of free software tools available to help the speech technologist in their work. Unfortunately, although each tool can perform its task extremely well, these tools are often not as good at talking to each other. As well as alleviating repetitive tasks, SpeechCluster can act as an abstraction layer above these various tools. I have found SpeechCluster useful when working with the following:
Title | Address |
---|---|
Emu | http://emu.sourceforge.net/ |
Festival | http://www.cstr.ed.ac.uk/projects/festival/ |
Festvox | http://festvox.org |
HTK | http://htk.eng.cam.ac.uk/ |
Praat | http://www.fon.hum.uva.nl/praat/ |
Sphinx/Train | http://cmusphinx.sourceforge.net/html/cmusphinx.php |
I find the scripts very helpful in my work. They are available here under a BSD-type open-source licence. If you find them useful, if you write a script that uses SpeechCluster, or if you think of a useful extension, please let me know.
If you use SpeechCluster or its associated tools in published research, please use this citation in your references section:
Uemlianin, I. A. (2005). "SpeechCluster: A speech database builder's multitool." Lesser Used Languages & Computer Linguistics proceedings. European Academy, Bozen/Bolzano, Italy.
SpeechCluster and the associated command-line tools are all written in Python. Most Linux distrubutions come with Python as standard. If Python is not installed on your system, it is very simple to install (especially on Windows and MacOS X systems). The Python home page is here: http://www.python.org.
That is the only requirement.
Note: at the moment, the only audio format supported by SpeechCluster is Microsoft's RIFF format (.wav). Furthermore, SpeechCluster assumes audio signals are mono. For this reason, the unix utility sox will be useful. Linux distributions will have this. The sox homepage is: http://sox.sourceforge.net.
SpeechCluster.tar.gz includes SpeechCluster.py and all the command-line tools below. It can be downloaded from here. The individual tools can be downloaded from here.
There is no development plan as such (at all, in fact): SpeechCluster accompanies me in my work and develops accordingly. Having said that, here are some of the things I am 'looking forward to':
SpeechCluster.py is a python module containing some object classes - Segment, SegmentationTier and SpeechCluster - which represent speech segmentation and the associated audio. SpeechCluster can read and write a number of label file formats, and wav format audio. The Command-line tools below cover the most common use-cases (so far).
Supported Label formats include:
Format | As used by | SpeechCluster name |
---|---|---|
.TextGrid | Praat | TextGrid |
.lab | Emu, Festival | lab, esps |
.lab | HTK (n.b.: different from above) | htk-lab |
.mlf | HTK | htk-mlf |
.txt | HTK | htk-grm |
SpeechCluster and the tools below can read/write/convert any of these formats in any direction.
These tools can be used as they are, or they can be taken as example use-cases for SpeechCluster. If none of these tools fits your exact requirements, you may be able to change the code of the nearest fit, or even to write your own tool.
segFake.py does 'fake autosegmentation' of a speech audio file. At the moment it assumes one utterance per file, with bounding silences. segFake detects utterance onset and offset, and spreads the given labels evenly over the intervening time.
The chances of getting any label boundary correct are of course virtually zero, but I have found it quicker and easier to correct one of these than to start labelling from scratch. Correcting a 'fake' transcription is also less error-prone, as the labels to use are already provided and don't need to be specified by the user.
segFake.py -f <filename> -o (TextGrid | esps | htklab ) <phones>
e.g.: segFake.py -f amser012.wav -o TextGrid m ai hh ii n y n j o n b y m m y n y d w e d i yy n y b o r e
segFake.py -d <dirname> -t <transcription filename> -o (TextGrid | esps | htklab)
e.g.: segFake.py -d wav -t trans.txt -o TextGrid
Transcription files should be one transcription per line, of the form:
(amser012 "m ai hh i n y n j o n b y m m y n y d w e d i yy n @ b o r e.")
n.b.: This format is based on the format of Festival's prompt files. The quotes around the transcription are necessary, but the final punctuation is not. In fact, segFake strips off the final punctuation mark (if it's there) before segmenting.
segInter.py interpolates labels into a segmented but unlabelled segment tier [in Praat]. For example, you label a file phonemically, mark the word boundaries but don't type in the words themselves. If you have the text available, you can use segInter to fill in the word tier. This can save you a lot of typing and fiddling about.
segInter.py [-l <level>] -f <label filename> <labels>
e.g.: segInter.py -f amser035 Mae hi ychydig wedi chwarter i hanner nos
segInter.py [-l <level>] -d <dir> -i <transcription filename>
e.g.: segInter.py -d lab -i amser.data
only TextGrid label format is supported
the default level/tierName is 'Word'
labels do not have to be quoted on the command-line
transcription files should be one transcription per line, of the form:
(amser012 "m ai hh i n y n j o n b y m m y n y d w e d i yy n @ b o r e.")
n.b.: This format is based on the format of Festival's prompt files. The quotes around the transcription are necessary, but the final punctuation is not. In fact, segFake will strip off the final punctuation mark (if it's there) before segmenting.
segInter assumes that the first and last segments in the textGrid are silence, and adds Word-level silences accordingly (i.e. you don't have to specify them explicitly).
Merges label files into one multi-tiered label file, for example to compare different segmentations of a speech file.
n.b.: Currently only works on textGrids (and takes first tier of multi-tiered textGrids).
segMerge.py <fn1> <tierName> <fn2> <tierName> <fn3> <tierName> ...
for example:
segMerge.py eg1.TextGrid Me eg2.TextGrid Them eg2.TextGrid Fake ...
Label file label converter.
segReplace -r <replaceDict filename> <segfilename>
segReplace -r <replaceDict filename> -d <dirname>
n.b.: segReplace changes labels in place, so keep a back-up of your old versions!
The replaceDict file should contain the following (a python dictionary):
n.b.:
segSwitch.py converts between label file formats, either on single files, or a directory at a time.
segSwitch -i <infilename> -o <outfilename>
segSwitch -i <infilestem>.mlf -o <outFormat>
segSwitch -d <dirname> -o <outFormat>
Format | File Extension(s) |
---|---|
esps | .esps, .lab, .seg |
Praat TextGrid | .TextGrid |
htk label file | .htk-lab |
htk master label file | .htk-mlf |
htk transcription | .htk-grm |
n.b.: currently, segSwitch will only convert into not out of htk-grm format.
splitAll.py takes a directory full of paired speech audio and label files (e.g., wav and TextGrid), and splits each wave/labelfile pair into paired subsections, according to various split parameters such as number of units or silence (where "units" can be phones, words, silences, etc.).
splitAll.py -n <integer> -t <tierName> [-l <label>] inDir outDir
inDir should contain pairs of speech audio and label files (e.g., wav and TextGrid). splitAll will split each pair into shorter paired segments, based on the parameters given.
splitAll.py -n 5 -t Phone in/ out/ # into 5 phone chunks
splitAll.py -n 1 -t Word in/ out/ # by each word
splitAll.py -n 1 -t Phone -l sil in/ out/ # by each silence
splitAll.py -n 5 -t Second in/ out/ # into 5 sec chunks
segDiBo adds explicit diphone boundaries to label files, ready for use in festival diphone synthesis. It also outputs pitchmark (pm) files. segDiBo'd label files (fstem_dibo.ext) and pm files are output into the given data directory.
segDiBo.py -d <dataDirectory>
Trims beginning and end silence from wav files, adjusts any associated label files accordongly.
trim.py -p 1.5 example.wav # trims example.wav leaving 1.5s padding
trim.py -p 1.5 example # as above, adjusts any seg files found too
trim.py -d testdir # trims all files in testdir, including any seg files,
# leaving .5s padding
Prys, Delyth, Briony Williams, Bill Hicks, Dewi Jones, Ailbhe Nà Chasaide, Christer Gobl, Julie Berndsen, Fred Cummins, Máire Nà Chiosáin, John McKenna, Rónán Scaife, Elaine Uà Dhonnchadha. WISPR: Speech Processing Resources for Welsh and Irish. Pre-Conference Workshop on First Steps for Language Documentation of Minority Languages, 4th Language Resources and Evaluation Conference (LREC), Lisbon, Portugal, 24-30 May 2004.
Uemlianin, Ivan A. (2005). SpeechCluster: A speech database builder's multitool. Lesser Used Languages & Computer Linguistics proceedings. European Academy, Bozen/Bolzano, Italy.