ASR Tools Minicourse, 2002
Mark Hasegawa-Johnson
- PDF version of this course
- Bash, Sed, Awk
- Installation
- Reading Assignment
- A bash/sed example
- Homework Assignment
- Perl
- Reading
- An Example
- Language Modelin
- Homewor
- Training Monophone Models Using HTK
- Installation
- Reading
- Creating Label and Script File
- Creating Acoustic Feature File
- HMM Training
- Testing
- Words and Triphone
- Reading
- Cepstral Mean Subtraction; Single-Pass Retraining
- Dictionaries
- Transcription
- Creation of MFCC
- Bigram Grammar
- Monophone HMM
- Word-Internal Triphone
- Tied-State Triphone
- Prosody
- Reading
- Prosody-Dependent Transcription
- Prosody-Dependent Dictionar
- Prosody-Dependent HMM
- Prosody-Dependent Speech Recognition
Bash, Sed, Awk
Installation
If you are on a unix system, bash, sed, gawk, and perl are
probably already installed. If not, ask your system
administrator.
If you are on Windows, download the Cygwin setup program from
http://www.cygwin.com. cygwin installation can be run as many
times as you like; anything already installed on your PC will not
be re-installed. In the screen that asks you which pieces of the
OS you want to install, be sure to select (DOC)$\rightarrow$(man),
(Interpreters)$\rightarrow$(gawk,perl), and
(TEXT)$\rightarrow$(less). I also recommend
(Math)$\rightarrow$(bc), a simple text-based calculator, and
(Network)$\rightarrow$(inetutils,openssh). You can also install a
complete X-windows server and set of clients from
(XFree86)$\rightarrow$(fvwm,lesstif,Xfree86-base,XFree86-startup,etc.),
allowing you to (1) install X-based programs on your PC, and (2)
run X-based programs on any unix computer on the network, with I/O
coming from windows on your PC. Setting up X requires a little
extra work; see http://xfree86.cygwin.com.
If cygwin is installed in your computer in the directory
c:/cygwin, it will create a stump of a unix hierarchy starting in
that directory. For example, the directory
c:/cygwin/usr/local/bin is available under cygwin as
/usr/local/bin. Arbitrary directories elsewhere on your c: drive
are available as /cygdrive/c/\ldots.
In order to use cygwin effectively, you need to set the
environment variables HOME (to specify your home directory),
DISPLAY (if you are using X-windows), and most importantly, PATH
(to specify directories that should be searched for useful
programs. This list should include at least
/bin;/usr/bin;/usr/local/bin;/usr/X11R6/bin).
In order to use bash, sed, awk, and perl, you will also need a
good ASCII text editor. You can download Xemacs from
http://www.xemacs.org.
Reading Assignments
Manual pages are available, among other places, at
http://www.gnu.org/manual. If your computer is
set up correctly, you can also read the bash man page by typing
'man bash' at the cygwin/bash prompt.
\begin{itemize}
\item Read the bash manual page, sections: (Basic Shell
Features)$\rightarrow$(Shell Syntax, Shell Commands, Shell
Parameters, Shell Expansions). (Shell
Builtins)$\rightarrow$(Bourne Shell Builtins, Bash Conditional
Expressions, Shell Arithmetic, Shell Scripts). Alternatively, you
can try reading the tutorial chapter in the O'Reilly bash book.
\item Read the sed manual page, or the sed tutorial chapter in the
O'Reilly 'sed and awk' book.
\item You may eventually want to learn gawk, but it's not
required. The section of the gawk man page called 'Getting Started
with awk' is pretty good. So are the tutorial chapters in the
O'Reilly 'sed and awk' book.
\end{itemize}
A bash/sed example
Why not use C all the time? The answer is that some tasks are
easier to perform with other programming languages:
\begin{itemize}
\item Manipulate file hierarchies: use bash and sed.
\item (Simple manipulation of tabular text files: gawk)
\item Manipulate text files: use perl.
\item Manipulate non-text files: use C.
\end{itemize}
perl can do any of these things, but isn't very efficient for
numerical calculations. C can also do any of the things listed,
but perl has many builtin tools for string manipulation, so it's
worthwhile to learn perl. gawk is easier than perl for simple
manipulation of tabular text; it's up to you whether or not you
want to try learning it.
bash is a POSIX-compliant command interpreter, meaning that, like
the DOSshell, you can type in a program name, and the program will
run. Unlike the DOSshell, bash is also a pretty good programming
language (not as good as BASIC or perl, but better than DOSshell
or tcsh).
For example, suppose you want to search through the entire
/data/timit/train hierarchy\footnote{There are several copies of
TIMIT floating around the lab. You can also buy your own copy for
\$100 from http://www.ldc.upenn.edu, or download individual files
from that web site for free.}, apply the C program ``extract'' to
all WAV files in order to create MFC files, and create a file with
extension TRP containing only the third column of each PHN file,
and then move all of the resulting files to a directory hierarchy
under /data/newfiles/train (but the new directory hierarchy
doesn't exist yet). You could do all that by entering the
following, either at the bash command prompt or in a shell script:
\begin{verbatim}
if [ ! -e ${HOME}/newfiles/train ]; then
mkdir ${HOME}/newfiles;
mkdir ${HOME}/newfiles/train;
fi
for dr in dr{1,2,3,4,5,6,7,8}; do
if [ ! -e ${HOME}/newfiles/train/${dr} ]; then
echo mkdir ${HOME}/newfiles/train/${dr};
mkdir ${HOME}/newfiles/train/${dr};
fi
for spkr in `ls ${HOME}/timit/train/${dr}`; do
if [ ! -e ${HOME}/newfiles/train/${dr}/${spkr} ]; then
echo mkdir ${HOME}/newfiles/train/${dr}/${spkr};
mkdir ${HOME}/newfiles/train/${dr}/${spkr};
fi
cd ${HOME}/timit/train/${dr}/${spkr};
for file in `ls`; do
case ${file} in
*.wav | *.WAV )
MFCfile=${HOME}/newfiles/train/${dr}/${spkr}/`echo ${file} | sed 's/wav/mfc/;s/WAV/MFC/'`;
echo Copying file ${PWD}/${file} into file ${MFCfile};
extract ${file} ${MFCfile};;
*.phn | *.PHN )
TRPfile=${HOME}/newfiles/train/${dr}/${spkr}/`echo ${file} | sed 's/phn/trp/;s/PHN/TRP/'`;
echo Extracting third column of file ${PWD}/${file} into file ${TRPfile};
gawk '{print $3}' ${file} > ${TRPfile};
esac
done
done
done
\end{verbatim}
Once you have created the entire new hierarchy, you can list the
whole hierarchy using
\begin{verbatim}
ls -R ${HOME}/newfiles | less
\end{verbatim}
You may have noticed by now that bash suffers from cryptic syntax.
bash inherits syntax from 'sh', a command interpreter written at
AT\&T in the days when every ASCII character had to be chiseled on
stone tablets in triplicate; thus bash uses characters
economically. Three rules will help you to use bash effectively:
\begin{enumerate}
\item Keep in mind that ', `, and " mean very different things. \{
and \$\{ mean very different things. $[$ standing alone is a
synonym for the command 'test'.
\item When trying to figure out how bash parses a line, you need
to follow the seven steps of command expansion in the same order
that bash follows them: brace expansion, tilde expansion,
variable expansion, command substitution, arithmetic expansion,
word splitting, and filename expansion, in that order. No,
really, I'm serious. Trying to read bash as pseudo-English leads
only to frustration. \item When writing your own bash scripts,
trial and error is usually the fastest method. Use the 'echo'
command frequently, with appropriate variable expansions at each
level, so you can see what bash thinks it is doing.
\end{enumerate}
About gawk: the only thing you absolutely need to know about gawk
is that if you type the following command into the bash prompt,
the file foo.txt will contain the M'th, N'th, and P'th columns
from the file bar.txt (where M,N,and P should be any single digits):
\begin{verbatim}
gawk '{printf("%s\t%s\t%s\n",$M,$N,$P)}' bar.txt > foo.txt
\end{verbatim}
Homework Assignment
Create a bash shell script that does the following things. Note
that if you don't have the Switchboard transcription files, you
can download them from
http://www.isip.msstate.edu/projects/switchboard/.
\begin{itemize}
\item Find the `word' transcription file corresponding to the `A'
side of every conversation in Switchboard. Use gawk to copy the
2nd, 3rd, and 4th columns of that file to a new file that has the
same basic filename, but resides in the directory
\${HOME}/switchboard/A. Do NOT create subdirectories under
this directory --- your goal is to get all of the side-A word
transcriptions into the same directory.
\item Do the same thing to the side-B transcriptions. Put them
into \${HOME}/switchboard/B.
\end{itemize}
\newpage
Perl
Reading
\begin{itemize}
\item Chapter 1 of {\it Programming Perl} by Larry Wall and
Randall Schwartz (this book is sometimes available on-line at
perl.com; in the perl community, it is called ``The Camel Book''
in order to distinguish it from all of the other perl books
available). Larry Wall (the author of perl, as well as the author
of the book) is an occasional linguist with a good sense of humor.
This chapter is possibly the best written introduction to perl
data structures and control flow, and contains better
documentation on blocks, loops, and control flow statements (if,
unless, while, until) than the man pages.
\item The manual pages are available in HTML format on-line at
http://www.perl.com/doc/. Download the gzipped tar file, and
unpack it on your own PC, so that you can keep it open while you
program. You can read the manual pages using ``man'' under
cygwin, but it is much easier to navigate this complicated
document set using HTML. Before programming, you should read
Chapter 1 of the Camel Book, plus the {\tt perldata} man page and
the first halves of the {\tt perlref} and {\tt perlsub} manual
pages. While programming, you should have an HTML browser open so
that you can easily look for useful information in the three
manual pages just listed, and also in the {\tt perlsyn, perlop,
perlfunc}, and {\tt perlre} pages.
\item The perl motto is ``There is more than one way to do it.''
It is easy to write cryptic perl; it is somewhat more difficult to
write legible perl. The {\tt perlstyle} file contains a few
suggestions useful in the quest to make your code more readable.
\item {\bf Language Modeling}. Chen and Goodman (Computer Speech
and Language, 1998) performed extensive experiments using a
variety of N-gram interpolation methods, and developed an improved
method based on their experiments. Chen and XXX (IEEE Trans. SAP,
2000) contains a very readable review of the Chen-Goodman
interpolation method, followed by enlightening comparison to a new
maximum-entropy method.
\end{itemize}
An Example
The following file, {\tt timit\_durations.pl}, computes the mean
and variance of the durations of every phoneme in the TIMIT
database.
\begin{verbatim}
#!/usr/bin/perl
#
# Compute mean and variance of phoneme durations
# and logdurations in TIMIT.
#
# Usage (in a bash shell):
# timit_durations.pl d:/timit/TIMIT > timit_durations.txt
#
# Creates a text table in file timit_durations.txt
# showing, for each phoneme, the number of times it was seen,
# the mean and standard deviation of the duration in milliseconds,
# and the mean and standard deviation of the log duration (in log ms).
#
# Status messages are printed to STDERR (usually the terminal).
#
# Mark Hasegawa-Johnson, 6/18/2002
#
###############################################
# Subroutine to peruse a directory tree $_[0]
#
sub peruse_tree {
# Top directory is whatever was given as $_[0]
# If I can't open it, die with an error message
opendir(TOPDIR, $_[0]) || die "Can't opendir $_[0]: $!";
print STDERR "Reading from directory ",$_[0],"\n";
my(@filelist) = readdir(TOPDIR);
closedir(TOPDIR);
# Read entries of TOPDIR
foreach $filename (@filelist) {
# If last character of the filename is ., ignore it
if ( $filename =~ /\.$/ ) { }
# If it's any other directory,call peruse_tree on it
elsif ( -d "$_[0]/$filename" ) {
peruse_tree("$_[0]/$filename");
}
# If the filename ends in s[ix]\d+\.phn (case-insensitive),
# then call the function read_words
elsif ( $filename =~ /s[ix]\d+\.phn/i ) {
read_phones("$_[0]/$filename");
}
}
}
###################################################
# Subroutine to convert TIMIT phone labels into Switchboard labels
#
sub timit2switchboard {
my(@data) = @_;
# Process each record, while there are records left
for( my($n)=0; $n <= $#data; $n++ ) {
# Merge some TIMIT labels into Switchboard superclasses
if ( $data[$n][2] eq 'ax-h' ) { $data[$n][2] = 'ax'; }
if ( $data[$n][2] eq 'axr' ) { $data[$n][2] = 'er'; }
if ( $data[$n][2] eq 'ux' ) { $data[$n][2] = 'uw'; }
if ( $data[$n][2] eq 'ix' ) { $data[$n][2] = 'ih'; }
if ( $data[$n][2] eq 'dx' ) { $data[$n][2] = 't'; }
if ( $data[$n][2] eq 'nx' ) { $data[$n][2] = 'n'; }
if ( $data[$n][2] eq 'hv' ) { $data[$n][2] = 'hh'; }
# If this segment is a closure,
# look to see if it is followed by a release.
# If so, combine the two segments.
if ( my($stop) = ($data[$n][2] =~ /([bdgptk])cl/) ) {
# If next segment is the right stop release,
# or if next segment is jh or ch and this segment is tcl or dcl,
# set label of this segment equal to next segment,
# set end time of this segment equal to end of next segment,
# and delete the next segment.
if ( ($data[$n+1][2] eq $stop) ||
(($data[$n+1][2] =~ /[cj]h/) && ($stop =~ /[td]/)) ) {
$data[$n][2] = $data[$n+1][2];
$data[$n][1] = $data[$n+1][1];
splice(@data, $n+1, 1);
}
# Otherwise, this must be an unreleased stop,
# so best thing to do is just fix the phoneme label
else {
$data[$n][2] = $stop;
}
}
}
# Return the modified @data array
return(@data);
}
###################################################
# Subroutine to read phoneme data
#
sub read_phones {
# Initialize @data to null
my(@data) = ();
# Open the INPUTFILE or die with an error message
open(INPUTFILE,$_[0]) || die "Unable to open input file $_[0]: $!";
# Read in all lines from the INPUTFILE
foreach $_ () {
# Read the next line, and store it in a private array; next line if failure
chomp;
my(@record) = split;
# Push a reference to this new record onto the @data list
push(@data, \@record );
}
close(INPUTFILE);
# Convert phone labels into Switchboard labels
@data = timit2switchboard(@data);
# Process each record separately
foreach $record (@data) {
my($label) = $$record[2];
# Compute duration in milliseconds, assuming 16kHz sampling rate
my($duration) = ($$record[1] - $$record[0]) / 16;
my($logd) = log($duration);
# Increment the global counters $PHONES_SEEN and $ACC{$label}{'n'}
$PHONES_SEEN++;
$ACC{$label}{'n'}++;
# Add duration, square, logd, and logd^2 to accumulators
# Note that these accumulators are global.
# If this particular label has never before been seen,
# perl automagically creates $PHN_SUM{$label}, and gives
# it an initial value of zero. Very convenient.
# After that, the values keep on accumulating until the
# top-level script is finished.
$ACC{$label}{'sum'} += $duration;
$ACC{$label}{'sumsq'} += ( $duration * $duration );
$ACC{$label}{'sumlog'} += $logd;
$ACC{$label}{'sumsqlog'} += ( $logd * $logd );
}
}
###########################################
# Main Program
#
# Accumulate duration information
# from all directories specified
# on the command line
#
foreach $arg (@ARGV) {
peruse_tree($arg);
}
# When finished, print out a table
# Print the header of the phoneme table
print "LABEL\tN\tMEAN\tSTD\tMEANLOG\tSTDLOG\n";
# Phonemes are sorted in alphabetical order
foreach $label ( sort keys(ACC)) {
# Get the hash reference contained in $ACC{$label}
$hr = $ACC{$label};
# $n is the number of examples of this phoneme observed
# Mean is sum of durations divided by number of tokens seen
# Mean log is sum of log durations divided by number of tokens seen
# Std is sqrt( (sumsq of durations - mean*sum) / (n-1) )
# Stdlog is same as above, but with logs
$n = $$hr{'n'};
$mean = $$hr{'sum'}/$n;
$meanlog = $$hr{'sumlog'}/$n;
$std = sqrt( ($$hr{'sumsq'} - $$hr{'sum'} * $mean) / ($n-1));
$stdlog = sqrt( ($$hr{'sumsqlog'} - $$hr{'sumlog'} * $meanlog) / ($n-1) );
# Print a line of the output table
printf "%s\t%6d\t%6.0f\t%6.0f\t%6.2f\t%6.3f\n",$label,$n,$mean,
$std,$meanlog,$stdlog;
}
\end{verbatim}
Language Modeling
A probabilistic grammar of language $L$ may be considered
useful if it satisfies one of the following two objectives:
\begin{enumerate}
\item Specifies the probability of observing any particular string
of words, $W=[w_1,\ldots,w_M]$ in language $L$.
\item Specifies the various ways in which the meanings of words
$[w_1,\ldots,w_M]$ may be combined in order to compute a sentence
meaning, and specifies the probability that any one of the
acceptable sentence meanings is what the talker was actually
trying to say.
\end{enumerate}
An N-gram grammar is a stochastic automaton designed to satisfy
grammar objective number 1 in the most efficient manner possible:
\begin{equation}
p(W) = \prod_{m=1}^{M} p(w_m|w_{m-N+1},\ldots,w_{m-1})
\end{equation}
where the words $w_{-N},\ldots,w_{-1}$ are defined to be the
special symbol ``SENTENCE\_START.'' If the length of the N-gram,
$N$, is larger than the length of the sentence, $M$, a correct
N-gram specifies the probability of the sentence exactly. In
practice, most N-grams are either bigrams ($N=2$) or trigrams
($N=3$), although a few sites have experimented with
variable-length N-grams.
The maximum-likelihood estimate of the bigram probability
$p(w_m|w_{m-1})$ given any training corpus is
\begin{equation}
p_{ML}(w_m|w_{m-1}) = \frac{C(w_{m-1},w_m)}{\sum_{w_m} C(w_{m-1},w_m)}
\end{equation}
where the ``count'' $C(w_{m-1},w_m)$ is the number of times that
the given word sequence was observed in the training corpus.
Because of the infinite productivity of human language, there are
always an infinite number of perfectly reasonable word sequences
that will not be observed in any finite-sized training corpus
(typical language model training corpora contain 250,000,000
words). In order to allow the model to generalize to new
observations, higher-order N-grams may be interpolated with
lower-order N-grams. There are a number of ways to do this; one
method is using an arbitrary fixed reduction of the word count, as
follows:
\begin{equation}
p_{I}(w_m|w_{m-1}) = \left\{\begin{array}{ll}
\left(\frac{C(w_{m-1},w_m)-D}{C(w_{m-1})}\right) +
\left(\frac{DN_{1+}(w_{m-1}\bullet)}{C(w_{m-1})}\right)p_I(w_m) & C(w_{m-1},w_m)\ge 1 \\
\left(\frac{DN_{1+}(w_{m-1}\bullet)}{C(w_{m-1})}\right)p_I(w_m) & C(w_{m-1},w_m)=0
\end{array}\right.
\label{eq:interpolation}\end{equation}
where $D\le 1$ is an adjustable parameter, and $p_I(w_m)$ is any
valid unigram probability estimate. Typical interpolation methods
either use the maximum likelihood estimate
$p_{ML}(w_m)=C(w_m)/\sum_w C(w)$, or interpolate between the
$p_{ML}(w_m)$ and a ``0-gram'' distribution that assumes all words
to be equally likely. The term $N_{1+}(w_{m-1}\bullet)$ is the
number of distinct words that may follow $w_{m-1}$. This term is
necessary to make sure that
\begin{equation}
1 = \sum_{w_m} p(w_m|w_{m-1})
\end{equation}
Kneser and Ney (1995) demonstrated that
equation~\ref{eq:interpolation} gives best results if the
lower-order probability $p_I(w_m)$ is chosen so that
$p_I(w_m|w_{m-1})$ satisfies the following equation:
\begin{equation}
C(w_m) = \sum_{w_{m-1}} p_{I}(w_m|w_{m-1}) C(w_{m-1})
\label{eq:marginalLM}\end{equation}
Equation~\ref{eq:marginalLM} says that the higher-order
interpolated N-gram $p_I(w_m|w_{m-1})$ should be designed so that
the database count $C(w_m)$ is equal to its expected value given
the count $C(w_{m-1})$. Kneser and Ney demonstrated
that one interpolation formula that satisfies
equation~\ref{eq:marginalLM} is
\begin{equation}
p_{I}(w_m|w_{m-1}) = \left\{\begin{array}{ll}
\left(\frac{C(w_{m-1},w_m)-D}{C(w_{m-1})}\right) +
\left(\frac{DN_{1+}(w_{m-1}\bullet)}{C(w_{m-1})}\right)
\left(\frac{N_{1+}(\bullet w_m)}{N_{1+}(\bullet\bullet)}\right) & C(w_{m-1},w_m)\ge 1 \\
\left(\frac{DN_{1+}(w_{m-1}\bullet)}{C(w_{m-1})}\right)
\left(\frac{N_{1+}(\bullet w_m)}{N_{1+}(\bullet\bullet)}\right) & C(w_{m-1},w_m) = 0 \\
\end{array}\right.
\label{eq:kneser-ney}\end{equation} where $N_{1+}(\bullet\bullet)$ is
the total number of lexicographically distinct bigrams observed in the training data, i.e.
\begin{equation}
N_{1+}(\bullet\bullet) = \sum_{w_{m-1}}N_{1+}(w_{m-1}\bullet) = \sum_{w_m}N_{1+}(\bullet w_m)
\end{equation}
Chen and Goodman demonstrated two improvements to the Kneser-Ney
algorithm. First, they showed that the Kneser-Ney probabilities
may be interpolated down to the 0-gram probability. Second, they
showed that the discount parameter $D$ should depend on the
database count $C(w_{m-1}w_m)$, i.e.
\begin{equation}
D(w_{m-1},w_m) = \left\{\begin{array}{ll}
0 & C(w_{m-1},w_m) = 0 \\
D_1 & C(w_{m-1},w_m) = 1 \\
D_2 & C(w_{m-1},w_m) = 2 \\
D_{3+} & C(w_{m-1},w_m) \ge 3
\end{array}\right.
\label{eq:Dparameter}
\end{equation}
Chen and Goodman suggest several empirical and theoretical methods
for choosing the parameters $D_1,D_2,D_3$; their figure 11
suggests that for a small corpus (1 million words), the best
values are approximately $D_1=0.6, D_2=1.0, D_3=1.4$.
The top-level Chen-Goodman-Kneser-Ney probability $p(w_m|w_{m-1})$
is calculated according to
\begin{eqnarray}
p_{CGKN}(w_m|w_{m-1}) &=&
\left(\frac{C(w_{m-1},w_m)-D(w_{m-1},w_m)}{C(w_{m-1})}\right)
\\ &+&
\left(\frac{D_1N_{1}(w_{m-1}\bullet)+D_2N_2(w_{m-1}\bullet)+D_{3+}N_{3+}(w_{m-1}\bullet)}{C(w_{m-1})}
\right)p_{CGKN}(w_m)
\label{eq:chen-goodman-top}
\end{eqnarray}
where $N_1(w_{m-1}\bullet)$ is the number of distinct words that
follow $w_{m-1}$ exactly once in the training data,
$N_2(w_{m-1}\bullet)$ is the number that follow exactly twice, and
$N_{3+}(w_{m-1}\bullet)$ is the number that follow three or more
times. The lower-level probability $p_{CGKN}(w_m)$ is based on
$N_{1+}(\bullet w_m)$, just as in the Kneser-Ney probability formula,
but Chen and Goodman showed that this lower-level probability may
in turn be interpolated, thus
\begin{eqnarray}
p_{CGKN}(w_m) &=&
\left(\frac{N_{1+}(\bullet w_m)-D(w_m)}{N_{1+}(\bullet\bullet)}\right) \\
&+&
\left(\frac{D_1N_{1}(\bullet)+D_2N_2(\bullet)+D_{3+}N_{3+}(\bullet)}{N_{1+}(\bullet\bullet)}
\right)p_{CGKN}(\bullet)
\label{eq:chen-goodman-mid}
\end{eqnarray}
where $N_1(\bullet)$ is the number of words that appear exactly once
in the training data, and $p_{CGKN}(\bullet)$ is the 0-gram
probability (all words equally likely).
Equations~\ref{eq:chen-goodman-top} and~\ref{eq:chen-goodman-mid}
are relatively complicated, but notice that all terms in these two
equations can be computed from the bigram counts $C(w_{m-1}w_m)$.
In order to estimate the Chen-Goodman-Kneser-Ney probability,
then, it is sufficient to find the count $C(w_{m-1}w_m)$ of every
bigram pair that appears in the training database, and then add up
those numbers in appropriate ways.
Homework
Use perl to accumulate sufficient statistics from the Switchboard
corpus for the estimation of a Chen-Goodman-smoothed bigram
language model. Your code should have two distinct sections: (1)
First, find the bigram counts $C(w_{m-1}w_m)$ for every possible
word-pair in the training data, and then (2) manipulate the bigram
counts in order to calculate $p_{CGKN}(w_m|w_{m-1})$. .
``Words'' that begin and end with square brackets (e.g.,
``[laughter],'' ``[silence]'') should be merged into a single
category (perhaps ``[silence]''). Partial-word utterances that use
square brackets and possibly a dash should be converted into the
full-word code before being entered into your database count (e.g.
``[com]puter'' becomes ``computer'', ``[be]cau[se]-'' becomes
``because'', ``-[a]bout'' becomes ``about'').
Notice that there are more than 30,000 distinct words in
Switchboard. If you try to represent $C(w_{m-1}w_m)$ or
$p(w_m|w_{m-1})$ as a fully enumerated table, you will wind up
with a table of size 900M. Don't do that. Instead,
$C(w_{m-1}w_m)$ should include entries for only the bigram pairs
that have a nonzero count in the database (about 2M entries). The
bigram probability $p(w_m|w_{m-1})$ of any bigram with zero count
is composed of two terms: $p_{CGKN}(w_m)$ (depends only on $w_m$),
and a term that depends only on $w_{m-1}$. Store these two terms
as separate output tables, with about 30K entries for each of
these two tables.
If you have extra time, consider training your language model on
part of the Switchboard corpus, and then testing it on the
remaining part. Check to see whether your cross-entropy measure
is comparable to the cross-entropy measures that Chen and Goodman
obtained on Switchboard (Figs. 3 and 4 show bigram and trigram
Jelinek-Mercer smoothing; Figs. 5 and 6 show the advantage
relative to Jelinek-Mercer of a number of different algorithms).
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% HTK Section
%
Training Monophone Models Using HTK
Installation
Download HTK from http://htk.eng.cam.ac.uk/. You should download
the standard distribution (gzipped tar file). You may also wish to
download the samples.tar.gz file, which contains a demo you can
run to test your installation. You may also wish to download the
pre-compiled PDF copy of the HTKBook.
Compile HTK as specified in the README file. Under windows, you
will need to use a DOS prompt to compile, because the VCVARS32.bat
file will not run under cygwin.
Add the bin.win32 directory to your path (or appropriate other bin
directory, if you are on unix). In order to test your
distribution, move to the samples/HTKDemo directory, and (assuming
you are in a cygwin window by now) type ./runDemo.
Readings
\begin{enumerate}
\item Primary readings are from the HTKBook. Before you begin,
read sections 3.1.5-3.4 and 6.2. Before you start creating
acoustic features, read sections 5.1-5.2, 5.4, 5.6, 5.8-5.11, and
5.16. Before you start training your HMMs, read sections 7.1-7.2
and 8.1-8.5.
\item Those who do not already know HMMs may wish to read either
HTKBook chapter 1, or read Rabiner (IEEE ASSP Magazine, January
1986) and Juang et al. (IEEE Trans. Information Theory
32(2):307-309, 1986). Even those who already know HMMs may be
interested in the discussion of HTK's token-passing algorithm in
section 1.6 of the HTKBook.
\end{enumerate}
Creating Label and Script Files
A {\em script} file in HTK is a list of speech or feature files to
be processed. HTK's feature conversion program, HCopy, expects an
ordered list of pairs of input and output files. HTK's training
and test programs, including HCompV, HInit, HRest, HERest, and
HVIte, all expect a single-column ordered list of acoustic feature
files. For example, if the file TRAIN2.scp contains
\begin{verbatim}
d:/timit/TIMIT/TRAIN/DR8/MTCS0/SI1972.WAV data/MTCS0SI1972.MFC
d:/timit/TIMIT/TRAIN/DR8/MTCS0/SI2265.WAV data/MTCS0SI2265.MFC
...
\end{verbatim}
then the command line ``HCopy -S TRAIN2.scp \ldots'' will convert
SI1972.WAV and put the result into data/MTCS0SI1972.MFC (assuming
that the ``data'' directory already exists). Likewise, the command
``HInit -S TRAIN1.scp \ldots'' works if TRAIN1.scp contains
\begin{verbatim}
data/MTCS0SI1972.MFC
data/MTCS0SI2265.MFC
...
\end{verbatim}
The long names of files in the ``data'' directory are necessary
because TIMIT files are not fully specified by the sentence
number. The sentence SX3.PHN, for example, was uttered by talkers
FAJW0, FMBG0, FPLS0, MILB0, MEGJ0, MBSB0, and MWRP0. If you
concatenate talker name and sentence number, as shown above, the
resulting filename is sufficient to uniquely specify the TIMIT
sentence of interest.
A {\em master label file} (MLF) in HTK contains information about
the order and possibly the time alignment of all training files or
all test files. The MLF must start with the seven characters
``\#!MLF!\#'' followed by a newline. After the global header line
comes the name of the first file, enclosed in double-quote
characters ("); the filename should have extension .lab, and the
path should be replaced by ``*''. The next several lines give the
phonemes from the first file, and the first file entry ends with a
period by itself on a line. For example:
\begin{verbatim}
#!MLF!#
"*/SI1972.lab"
0 1362500 sil
1362500 1950000 p
21479375 22500000 sil
.
"*/SI1823.lab"
...
\end{verbatim}
In order to use the initialization programs HInit and HRest, the
start time and end time of each phoneme must be specified in units
of 100ns (10 million per second). In TIMIT, the start times and
end times are specified in units of samples (16,000 per second),
so the TIMIT PHN files need to be converted. The times shown
above in 100ns increments, for example, correspond to the
following sample times in file SI1972.PHN:
\begin{verbatim}
0 2180 h#
2180 3120 p
...
\end{verbatim}
Notice that the ``h\#'' symbol in SI1972.PHN has been changed into
``sil''. TIMIT phoneme labels are too specific; for example, it
is impossible to distinguish ``pau'' (pause) from ``h\#''
(sentence-initial silence) or from ``tcl'' (/t/ stop closure) on
the basis of short-time acoustics alone. For this reason, when
converting .PHN label files into entries in a MLF, you should also
change phoneme labels as necessary in order to eliminate
non-acoustic distinctions. Some possible label substitutions are
pau:sil (silence), h\#:sil, tcl:sil, pcl:sil, kcl:sil, bcl:vcl
(voiced closure), dcl:vcl, gcl:vcl, ax-h:axh, axr:er, ix:ih,
ux:uw, nx:n, hv:hh. The segments /q/ (glottal stop) and /epi/
(epinthetic stop) can be deleted entirely.
All of the conversions described above can be done using a single
perl script that searches through the TIMIT/TRAIN hierarchy. Every
time it finds a file that matches the pattern \verb:S[IX]\d+.PHN:
(note: this means it should ignore files SA1.PHN and SA2.PHN), it
should add necessary entries to the files TRAIN1.scp, TRAIN2.scp,
and TRAIN.mlf, as shown above. When the program is done searching
the TIMIT/TRAIN hierarchy, it should search TIMIT/TEST, creating
the files TEST1.scp, TEST2.scp, and TEST.mlf.
Finally, just in case you are not sure what phoneme labels you
wound up with after all of that conversion, the TRAIN.mlf file can
be parsed as follows to get your phoneme set:
\begin{verbatim}
awk '/[\.!]/{next;}{print $3}' TRAIN.mlf | sort | uniq > monophones
\end{verbatim}
The first block of awk code skips over any line containing a
period or exclamation point. The second block of awk code looks
at remaining lines, and prints out the third column of any such
lines. The unix sort and uniq commands sort the resulting phoneme
stream, and throw away duplicates.
Creating Acoustic Feature Files
Create a configuration file similar to the one in HTKBook page 32.
Add the modifier \verb:SOURCEFORMAT=NIST: in order to tell HTK
that the TIMIT waveforms are in NIST format.
I also recommend a few changes to the output features, as follows.
First, compute the real energy (MFCC\_E) instead of the cepstral
pseudo-energy (MFCC\_0). Second, set {\tt ENORMALISE} to {\tt T}
(or just delete the {\tt ENORMALISE} entry).
Third, because the TIMIT sampling rate (16kHz) is higher than the
sampling rate considered in Chapter 3 (probably 8kHz, though it is
never specified), you should use more mel-frequency channels, a
longer lifter, and a longer cepstral feature vector. How many
more? Well, the human auditory system distinguishes about 26
critical bands below 4kHz, but only about 6 more critical bands
between 4kHz and 8kHz; since MFCC warps the frequency axis to
imitate human hearing, you only need to increase NUMCHANS from 26
to about 32. Increasing NUMCHANS causes an increase in the
pseudo-temporal resolution of the cepstral vector, so you should
increase all of the parameters NUMCHANS, CEPLIFTER and NUMCEPS by
about the same percentage.
Use {\tt HCopy} to convert TIMIT waveform files into MFCC, as
specified on page 33 of the HTK book. Convert both the TRAIN and
TEST corpora of TIMIT.
HMM Training
Use a text editor to create a prototype HMM with three emitting
states (five states total), and with three mixtures per emitting
state (see Fig. 7.3). Be sure that your mean and variance vectors
contain the right number of acoustic features: three times the
number of cepstral coefficients, plus three energy coefficients.
Change your configuration file: eliminate the {\tt SOURCEFORMAT}
specifier, and change {\tt TARGETKIND} to {\tt MFCC\_E\_D\_A}.
Use {\tt HCompV} as specified on page 34 to create the files
hmm0/proto and hmm0/vFloors. Next, use your text editor to
separate hmm/macros (as shown in Fig. 3.7) from the rest of the
file hmm0/proto (the first line of hmm0/proto should now read
\verb:~h "proto":).
Because your .lab files specify the start and end times of each
phoneme in TIMIT, you can use HInit and HRest to initialize your
HMMs before running HERest. Generally, the better you initialize
an HMM, the better it will perform, so it is often a good idea to
use HInit and HRest if you have relevant labeled training data.
Run HInit as shown on page 120, i.e., if \$phn is the name of some
phoneme, type something like
\begin{verbatim}
mkdir hmm1;
HInit -I TRAIN.mlf -S TRAIN1.scp -H hmm0/macros -C config -T 1 -M hmm1 -l $phn hmm0/proto;
sed "s/proto/$phn/" hmm1/proto > hmm1/$phn;
\end{verbatim}
Hint: once you have the lines above working for one phoneme label, put them
inside a for loop to do the other phonemes.
Re-estimate the phonemes using HRest, as shown on page 123. Again,
once you have the function working for one phoneme, put it inside
a for loop. HRest will iterate until the log likelihood converges
(use the -T 1 option if you want to see a running tally of the log
likelihood), or until it has attempted 20 training iterations in a
row without convergence. If you want to allow HRest to iterate
more than 20 times per phoneme (and if you have enough time),
specify the -i option (I used -i 100).
Once you have used HRest, you may wish to combine all of the
trained phoneme files into a single master macro file (MMF).
Assuming that all of your phoneme filenames are 1-3 characters in
length, and that the newest versions are in the directory hmm2,
they can be combined by typing
\begin{verbatim}
cat hmm2/? hmm2/?? hmm2/??? > hmm2/hmmdefs
\end{verbatim}
Now run the embedded re-estimation function HERest to update all
of the phoneme files at once. HERest improves on HRest because it
allows for the possibility that transcribed phoneme boundaries may
not be precisely correct. HERest can also be used to train a
recognizer even if the start and end times of individual phonemes
are not known.
Unfortunately, HERest only performs one training iteration each
time the program is called, so it is wise to run HERest several
times in a row. Try running it ten times in a row (moving from
directory hmm2 to hmm3, then hmm3 to hmm4, and so on up to hmm12).
Hint: put this inside a for loop.
Testing
In order to use HVIte and HResults to test your recognizer, you
first need to create a ``dictionary'' and a ``grammar.''
For now, the ``grammar'' can just specify that a sentence may
contain any number of phonemes:
\begin{verbatim}
$phone = aa | ae | ... | zh ;
( <$phone> )
\end{verbatim}
Parse your grammar using HParse as specified on page 27.
The ``dictionary'' essentially specifies that each phoneme equals
itself:
\begin{verbatim}
aa aa
ae ae
...
\end{verbatim}
Because the dictionary is so simple, you don't need to parse it
using HDMan. You can ignore all of the text associated with Fig.
3.3 in the book.
Run HVIte as specified in section 3.4.1 of the book; instead of
``tiedlist,'' you should use your own list of phonemes (perhaps
you called it ``monophones''). You may have to specify -C config,
so that HVite knows to compute delta-cepstra and accelerations.
The -p option specifies the bonus that HVIte gives itself each
time it inserts a new word. Start with a value of -p 0. Use -T 1
to force HVIte to show you the words it is recognizing as it
recognizes them. If there are too many deletions, increase -p; if
there are too many insertions, decrease -p.
When you are done, use HResults to analyze the results:
\begin{verbatim}
HResults -I TEST.mlf monophones recout.mlf
\end{verbatim}
You should get roughly 55-60\% correct, and your recognition
accuracy should be somewhere in the range 40-60\%. These terms are
defined as follows:
\[
\mbox{CORRECTNESS} = 100\times\frac{\mbox{NREF} - \mbox{SUBSTITUTIONS} - \mbox{DELETIONS}}{\mbox{NREF}}
\]
\[
\mbox{ACCURACY} = 100\times\frac{\mbox{NREF} - \mbox{SUBSTITUTIONS} - \mbox{DELETIONS} - \mbox{INSERTIONS}}{\mbox{NREF}}
\]
Correctness is equal to the percentage of the reference labels
(NREF) that were correctly recognized. Correctness does not
penalize for insertion errors. Accuracy is a more comprehensive
measure of recognizer quality, but it has many counter-intuitive
properties: for example, Accuracy is not always between 0 and 100
percent. Recent papers often use the terms Precision and Recall
instead, where Recall is defined to equal Correctness, and
Precision is the percentage of the recognized labels that are
correct, i.e.,
\[
\mbox{PRECISION} = 100\times\frac{\mbox{NRECOGNIZED} - \mbox{SUBSTITUTIONS} - \mbox{INSERTIONS}}{\mbox{NRECOGNIZED}}
\]
\[
\mbox{NRECOGNIZED} = \mbox{NREF} - \mbox{DELETIONS} + \mbox{INSERTIONS}
\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
\newpage
Words and Triphones
In this section, you will use the TIMIT monophone HMMs trained in
the previous lecture as the starting point for a clustered
triphone recognizer designed to transcribe the words spoken by a
talker in the BU Radio News corpus.
Readings
{\em The HTK Book,} chapters 10, 12, and sections from 14 about
HBuild, HLStats, HHEd and HLEd.
Cepstral Mean Subtraction; Single-Pass Retraining
If $\vec{x}_t$ is a log-spectral vector or a cepstral vector, the
frequency response of the microphone and the room will influence
only the average value of $\vec{x}_t$. It is possible to reduce
the dependence of your recognizer on any particular microphone by
subtracting the average value of $\vec{x}_t$, averaged over an
entire utterance, before training or testing the recognizer, i.e.,
\begin{equation}
\vec{y}_t = \vec{x}_t - \frac{1}{T}\sum_{t=1}^{T} \vec{x}_t
\label{eq:cms}\end{equation}
Equation~\ref{eq:cms} is called cepstral mean subtraction, or CMS.
In HTK, CMS is implemented automatically if you append ``\_Z'' to
the feature specification. For example, you can save the features
as type {\tt MFCC\_E}, then use a configuration file during
training and testing that specifies a feature vector of type {\tt
MFCC\_E\_D\_A\_Z}.
HTK offers a method called ``one-pass retraining'' (HTKBook
section 8.X) that uses models trained with one type of feature
vector (for example, {\tt MFCC\_E\_D\_A}) in order to rapidly
train models with a different feature vector type (for example,
{\tt MFCC\_E\_D\_A\_A}). In theory, you need to have available
files of both data types, but since HTK can implement CMS on the
fly when opening each feature file, there is no need to regenerate
the training data. Just create a script file with two columns ---
the ``old'' and ``new'' feature files, which in this case are the
same file:
\begin{verbatim}
data/SI1972.MFC data/SI1972.MFC
...
\end{verbatim}
Then create a configuration file with entries {\tt HPARM1} and
{\tt HPARM2}, as specified in section 8.X of the HTKBook, and call
{\tt HERest} with the {\tt -r} option, exactly as specified in
that section. Compare the hmmdefs files for the old and new file
types. You should notice that the feature file type listed at the
top of each file has changed. You should also notice that the
mean vectors of each Gaussian mixture have changed a lot, but the
variance vectors have not changed as much.
Dictionaries
In order to recognize words using sub-word recognition models, you
need a pronunciation dictionary. Pronunciation dictionaries for
talker F1A in the Radio News corpus are provided in
F1A/RADIO/F1A.PRN and F1A/LABNEWS/F1ALAB.PRN.
These dictionaries contain a number of diacritics that will be
useful later, but are not useful now. Use sed, awk, or perl to
get rid of the characters * and | (syllable markers), and the
notation ``+1'' or ``+2'' in any transcription line. In order to
reduce the number of homonyms, you may also wish to convert all
capital letters to lower-case (so that ``Rob'' and ``rob'') are
not distinct), and also eliminate apostrophes (so that ``judges''
and ``judges' '' are not distinct). You will also wish to make a
few label substitutions in order to map radio news phonemes into
the TIMIT phonemes defined last week: axr becomes er, pau and h\#
become sil, and every stop consonant (b,d,g,p,t,k) gets split into
two consecutive TIMIT-style phones: a closure followed by a stop
release.
As an example, the radio news dictionaries might contain the
following entry for ``Birdbrain's''
\begin{verbatim}
Birdbrain's b axr+1 d * b r ey n z
\end{verbatim}
Assuming that your TIMIT-based phoneme set includes er but not
axr, you would wish to automatically translate this entry to read
\begin{verbatim}
birdbrains vcl b er vcl d vcl b r ey n z
\end{verbatim}
or, by adding a rule that deletes the stop release when the
following segment is another consonant, you might get
\begin{verbatim}
birdbrains vcl b er vcl vcl b r ey n z
\end{verbatim}
Notice that there is another alternative: instead of modifying the
dictionary to match your HMM definitions, you could modify your
HMM definitions to match the dictionary. Specifically, er could be
relabeled as axr, sil could be relabeled as h\#, and you could
concatenate your stop closure and stop release states in order to
create new stop consonant models. You could even create models of
'*' and '|' with no emitting states.
Once you have converted your dictionaries, you should concatenate
them together, then apply the unix utilities sort and uniq to the
result, e.g., {\tt convert\_dict.pl F1A/RADIO/F1A.PRN
F1A/LABNEWS/F1ALAB.PRN | sort | uniq > F1A.dict}. HTK utilities
will not work unless the words in the dictionary are
orthographically sorted (alphabetic, all capitalized words before
all diminutive words).
Transcriptions
Create master label files using almost the same perl script that
you used for TIMIT, but with .WRD-file inputs instead of .PHN-file
inputs. Also, every waveform file in RADIO NEWS is uniquely
named, so you don't need to concatenate the directory and
filename. The resulting master label files should look something
like this, although the start times and end times are completely
optional:
\begin{verbatim}
#!MLF!#
"*/F1AS01P1.lab"
0 2100000 a
2100000 4700000 cape
4700000 7600000 cod
\end{verbatim}
In order to train the grammar, you need a word-level master label
file, as shown above. In order to train the HMMs, though, you
need a phoneme-level master label file. The phone-level MLF can
be computed from the dictionary + word-level-MLF using the {\tt
HLEd} command (see section 12.8 in the HTK Book). Create a file
called {\tt expand.hled} that contains just one command,
\begin{verbatim}
EX
\end{verbatim}
Then type
\begin{verbatim}
HLEd -d F1A.dict -l '*' -i phone\_level.mlf expand.hled word\_level.mlf
\end{verbatim}
If {\tt HLEd} fails, the most likely cause is that your master
label file contains entries with times but no words. {\tt
HLEd} will, unfortunately, not tell you where those entries
are. Try printing out all lines that have less than three columns
using a command like
\begin{verbatim}
gawk 'nf<3{print}' word\_level.mlf
\end{verbatim}
Scan the output to make sure that you don't have phantom ``words''
with start times and end times but no word labels.
Creation of MFCCs
Create a two-column and a one-column script file for your training
data, and the same for your test data, just as you did for TIMIT.
The two-column script file will look something like:
\begin{verbatim}
d:/radio_news/F1A/RADIO/S01/F1AS01P1.SPH data/F1AS01P1.MFC
d:/radio_news/F1A/RADIO/S01/F1AS01P2.SPH data/F1AS01P2.MFC
\end{verbatim}
You may use any subset of the data for training, and any other
subset for test. I trained speaker-dependent HMMs using the
F1A/RADIO directory, and tested using the F1A/LABNEWS directory.
You may get better recognition results if your training and test
set both include part RADIO data and part LABNEWS data.
Use {\tt HCopy} to convert waveforms to MFCCs.
Bigram Grammar
Construct a list of your entire vocabulary, including both
training and test sets, using
\begin{verbatim}
awk '{print $1}' F1A.dict | sort | uniq > wordlist
\end{verbatim}
Seeding your grammar with words from the test set is cheating, but
for datasets this small, it may be the only way to avoid huge
numbers of out-of-vocabulary errors.
Given a master label file for your entire training data, the
command HLStats will compute a backed-off bigram language model
for you, and HBuild will convert the bigram file into a format
that can be used by other HTK tools. See sections 12.4 and 12.5
in the HTKBook for examples; note that you will need to specify
both the -I and -S options to HLStats.
Monophone HMMs
If your dictionary matches the labels on your TIMIT monophone
models, you should be able to use the TIMIT models now to perform
recognition on the radio news corpus. Try it:
\begin{verbatim}
HVIte -C config_recog -H timit/macros -H timit/hmmdefs -S (one-column test script) \
-l '*' -i recout1.mlf -t 250.0 -w (HBuild output file) \
-p 5 -s 3 F1A.dict monophones
HResults -I (test MLF) monophones recout1.mlf
\end{verbatim}
The {\tt -p} and {\tt -s} options set the word insertion penalty
and the grammar weight, respectively. These parameters are
described in section 13.3 of the HTKBook. Adjusting these
parameters can cause huge changes in your recognition performance;
5 might or might not be a good value.
In any case, your results will probably be pretty horrible. Radio
news was recorded using different microphones than TIMIT, by
different talkers. You can account for these differences by
adapting the models (using {\tt HEAdapt}) or by re-estimating them
(using {\tt HERest}) --- you probably have enough data to use
re-estimation instead of adaptation.
Re-estimate your models using {\tt HERest}, and then run {\tt
HVIte} again. Your results should improve somewhat, but may still
be disappointing. How can you improve your results still further?
Word-Internal Triphones
In order to use word-internal triphones, you need to augment your
transcriptions using a special word-boundary ``phoneme'' label.
The {\tt sp} (short pause) phoneme is intended to represent zero
or more frames of silence between words.
Add {\tt sp} to the end of every entry in your dictionary using
awk or perl. {\em After} you have added {\tt sp} to the end of
every entry, add another entry of the form
\begin{verbatim}
silence sil
\end{verbatim}
The ``silence'' model must not end with an {\tt sp}.
Now you need to augment your HMM definitions, exactly as listed in
section 3.2.2 and 3.2.3 of the HTK book. This consists of four
steps. First, add {\tt sp} to the end of your {\tt monophones}
list file. Second, edit your hmmdefs file with a text editor, in
order to create the {\tt sp} model by copying the middle state of
the {\tt sil} model. Third, use {\tt HHEd} with the script given
in 3.2.2. Finally, use {\tt HVIte} in forced-alignment mode, in
order to create a new reference transcription of the training
data. Be sure to use the {\tt -b silence} option to add silences
to the beginning and end of each transcription; otherwise your
sentence will end with an {\tt sp} model, and that will cause {\tt
HERest} to fail.
Now that you have your word-boundary marker, you are ready to
create word-internal triphones. Use {\tt HLEd} exactly as in
section 3.3.1 of the HTKBook. Because of the small size of this
database, the test set may contain triphones missing from the
training data. In order to accomodate missing triphones,
concatenate the monophone and triphone files, so that any missing
triphones can at least be modeled using monophones:
\begin{verbatim}
sort monophones triphones | uniq > allphones
\end{verbatim}
Finally, use {\tt HHEd} as in section 3.3.1 of the HTKBook, but
use the {\tt allphones} list instead of the {\tt triphones} list
to specify your set of output phones.
Re-estimate your triphone models a few times using {\tt HERest}.
{\tt HERest} will complain that some triphones are observed only
one or two times in the training data. I guess we need a larger
training database.
Test the result using {\tt HVIte}. The presence of monophones in
your phoneme list will confuse {\tt HVIte}. In order to force the
use of triphones whenever possible, your config file should
contain the entries
\begin{verbatim}
FORCECXTEXP = T
ALLOWXWRDEXP = F
\end{verbatim}
Your recognition performance with triphones should be better than
it was with monophones.
Tied-State Triphones
Because of the sparsity of the training data, many triphone models
are not well trained. The problem can be alleviated somewhat by
using the same parameters in multiple recognition models. This
process is called ``parameter tying.''
Chapter 10 of the HTKBook describes many, many different methods
of parameter tying, all of which are frequently used in practical
recognition systems. I suggest using the data-driven clustering
method for the current exercise (section 10.4), although
tree-based clustering (section 10.5) might work almost as well.
Run {\tt HERest} with the {\tt -s} option, in order to generate a
file called {\tt stats\_file}. Then create an {\tt HHEd} script
that starts with the command {\tt RO (threshold) stats\_file}
where {\tt (threshold)} specifies the minimum expected number of
times a state should be visited in order to count for parameter
tying (I used 20).
Use perl, awk, or even just bash to add commands of the following
form to your {\tt HHEd} script:
\begin{verbatim}
TC 100.0 "aaS2" {(aa,*-aa,aa+*,*-aa+*).state[2]}
TC 100.0 "aaS3" {(aa,*-aa,aa+*,*-aa+*).state[3]}
TC 100.0 "aaS4" {(aa,*-aa,aa+*,*-aa+*).state[4]}
\end{verbatim}
You can be more general, if you like. For example, the following
command would allow HTK to consider tying together the first state
of {\tt aa} with the last state of any phoneme that precedes {\tt
aa}:
\begin{verbatim}
TC 100.0 "aaS2" {(aa,*-aa,aa+*,*-aa+*).state[2],(*-*+aa,*+aa).state[4]}
\end{verbatim}
Run {\tt HHEd} in order to perform data-based tying (use the {\tt
-T} option to see what {\tt HHEd} is doing). Use {\tt HERest} to
re-estimate the models a few times, then test using {\tt HVIte}
and {\tt HResults}. Your performance may still not be wonderful,
but it should be better than you obtained without parameter tying.
For reference, the NIST Hub-4 and Hub-5 competitions (1997 and
1998, respectively) used large databases of broadcast news
training data similar to the Radio News corpus. Typical
performance of the competition systems was in the range of 30-40\%
word error rate (word error rate = 100 - accuracy). The winning
system was trained using HTK, plus a large amount of external
code.
\newpage
Prosody
Recent papers talk about three aspects of prosody that
might be modeled by a speech recognition system:
\begin{itemize}
\item
Lexical stress: Lexically unstressed vowels may be transcribed and
modeled as a type of schwa (ax, ix, or axr), or as some type of full
vowel. The status of /ax/ as a distinct vowel has much empirical
support. /ix/ and /ax/ may not be distinct in practical systems. It
is possible to argue that /er/ is always reduced, so that /er/ and
/axr/ are not really distinct.
Several studies have examined the distinction between unreduced
unstressed vowels and stressed vowels. Greenberg (1999) found that
stressed vowels are longer and have higher energy than unstressed
vowels with the same phoneme label, but he did not control for accent
placement. van Kuijk et al. (1999) compared accented stressed,
unaccented stressed, unreduced unstressed, and reduced vowels, and
found no acoustic difference between the middle two categories.
Consonant reduction has apparently never been studied in speech
recognition.
\item
Pitch accent. When a pitch accent is placed on a word, it is usually
placed on or near the lexically stressed syllable. The duration,
energy, or spectral distinctiveness of the lexically stressed syllable
may then be increased. Articulatory studies (Fougeron et al.)
indicate that consonants in accented syllables are produced more
distinctively than consonants in unaccented syllables.
\item
Phrase boundaries. The rhyme of the final syllable of a word
preceding an intermediate or intonational phrase boundary is
lengthened relative to comparable phonemes in the sentence (Wightman
et al., 1991). In the Switchboard database, the duration histogram of
words preceding a silence or disfluency has a mode that is one
standard deviation higher than the duration histograms of other words
in the database. This increased duration may be combined by a
decrease in energy, and possibly by other spectral changes.
\end{itemize}
Reading
\begin{itemize}
\item Wightman, Shattuck-Hufnagel, and Price,
``Segmental durations in the vicinity of prosodic phrase boundaries.''
{\em J. Acoust. Soc. Am.} 91(3):1707-1717, 1992.
\item Fougeron and Keating,
``Articulatory strengthening at edges of prosodic domains.''
{\em J. Acoust. Soc. Am} 101:3728-3740, 1997.
\item Steven Greenberg and Leah Hitchcock,
``Stress-Accent and Vowel Quality in The Switchboard Corpus.'' {\em
NIST Large Vocabulary Continuous Speech Recognition Workshop,} May
2001.
\end{itemize}
Prosody-Dependent Transcriptions
Write a perl script that reads in WRD transcription files and either
TON or BRK files from one talker's data in the radio news corpus.
Your script should create an HTK master label file containing either
break-index-dependent or accent-dependent transcriptions. For
example, a break-index-dependent transcription might append the number
``4'' after every word with a break index of at least 4, e.g.
\begin{verbatim}
#!MLF!#
"*/F2BS01P1.lab"
0 0 endsil
0 1700000 a
1700000 6600000 nineteen
6600000 12500000 eighteen4
12500000 15600000 state
15600000 24300000 constitutional
24300000 30400000 amendment4
\end{verbatim}
An accent-dependent transcription might append an exclamation point
after every word containing a pitch accent, e.g.
\begin{verbatim}
#!MLF!#
"*/F2BS01P1.lab"
0 0 endsil
0 1700000 a
1700000 6600000 nineteen!
6600000 12500000 eighteen!
12500000 15600000 state!
15600000 24300000 constitutional
24300000 30400000 amendment!
\end{verbatim}
F2B has about 170 sentences transcribed with prosody, while F1A and
M1B have only about 75 sentences transcribed. Choose a talker with as
many sentences as possible transcribed for prosody.
Create a prosody-independent MLF by stripping out the prosodic symbols
from your prosody-dependent MLF. The prosody-independent recognizer
will serve as a reference model, so that you can tell whether any
advantage is obtained by modeling prosody.
Prosody-Dependent Dictionary
Write a perl script that reads in the dictionaries provided in the
Radio News corpus, and produces an HTK-format dictionary with both
prosody-dependent and prosody-independent versions of each word.
If you are studying break indices, the phonemes in the final syllable
of each pre-boundary word should be special pre-boundary phonemes,
e.g.,
\begin{verbatim}
abilities ax b ih l ax t iy z sp
abilities4 [abilities] ax b ih l ax t4 iy4 z4 sp
\end{verbatim}
If you are studying accent, the phonemes in the lexically stressed
syllable of each lexically stressed word should be special, e.g.,
\begin{verbatim}
abilities ax b ih l ax t iy z sp
abilities! [abilities] ax b! ih! l! ax t iy z sp
\end{verbatim}
Notice that in both cases, the output of HVIte should be specified to
be independent of prosody, using the square-bracket notation.
Use HLEd, together with your dictionary, to create prosody-dependent
and prosody-independent monophone-level MLF files.
Prosody-Dependent HMMs
Train a set of monophone models including the `sp' model (or copy
models from the last section).
Create an HHEd script that duplicates your monophone models to create
prosody-dependent models. If you are studying break indices, your
HHEd script might contain the line
\begin{verbatim}
DP "" 1 "4"
\end{verbatim}
If you are studying accent, your script might contain
\begin{verbatim}
DP "" 1 "!"
\end{verbatim}
Train both prosody-dependent and prosody-independent monophone models
by running HERest 3-5 times on each set of models.
Use HLEd to split the prosody-dependent and prosody-independent
monophone MLF files into triphone MLFs. Use HHEd to split the trained
HMM macro files. Train both prosody-dependent and prosody-independent
triphone models by running HERest 3-5 times on each model set.
Use HHEd to perform data-driven tying on the triphone models. For
example, if you are studying accent, your HHEd script for tying the
prosody-dependent HMMs might contain commands of the form
\begin{verbatim}
TC 100.0 "aaS2" {(aa,aa!,*-aa,*-aa!,aa+*,aa!+*,*-aa+*,*-aa!+*).state[2]}
\end{verbatim}
Train both prosody-dependent and prosody-independent clustered
triphone HMMs by running HERest 3-5 times on each model set.
Prosody-Dependent Speech Recognition
Train a prosody-dependent backoff bigram model by running HLStats on
your prosody-dependent transcription, and convert the result into a
wordnet using HBuild. The result is a language model that combines
information about both word sequence and the sequence of stresses or
phrase boundaries. For example, if you are studying accent, the
trained bigram file might contain entries of the following form. As
shown in these examples, you may find that accented words are more
likely to follow unaccented words, and vice versa.
\begin{verbatim}
-2.5682 a state
-2.0911 a state!
-1.3424 boston! city
-1.8195 boston! city!
\end{verbatim}
Train a prosody-independent backoff bigram model in the usual way.
You should now have two language models (prosody-dependent and
prosody-independent) and six sets of HMMS (PD and PI versions of
monophone, triphone, and clustered triphone models). Run HVIte six
times.
After running HVIte using prosody-dependent models, be sure to check
the output transcription. Because of the way the dictionary was
defined, prosodic markings should not show up in the output
transcription. Use HResults to compare all six recognition
transcripts with the true prosody-independent transcription. Does the
recognizer's knowledge of prosody help it to achieve better word
recognition accuracy?
Here is a different experiment that you can run using the same models:
try to determine how well the recognition models track just the
prosody of the utterance. Create another dictionary with output
symbols set to show just the prosody, and not the word content, e.g.
\begin{verbatim}
abilities [0] ax b ih l ax t iy z sp
abilities! [!] ax b! ih! l! ax t iy z sp
\end{verbatim}
Process your prosody-dependent MLF in order to show the same
accented/unaccented distinction, e.g.
\begin{verbatim}
#!MLF!#
"*/F2BS01P1.lab"
0 0 endsil
0 1700000 0
1700000 6600000 !
6600000 12500000 !
12500000 15600000 !
15600000 24300000 0
24300000 30400000 !
\end{verbatim}
Run HVIte again using the new dictionary, then run HResults using the
new transcription file. How well does your recognizer identify pitch
accent placement? (Recognition results for break index may be better
than recognition results for pitch accent placement, just because
pitch is not part of the acoustic feature vector...)
\end{document}