to University of Moratuwa DSP BASED SPEECH TRAINING OF HEARING IMPAIRED CHILDREN Submitted to the department of * Electronic & Telecommunication Engineering in partial fulfillment of the requirements for the Degree of Master of Engineering D.G. WEERASINGHE April 2002 T\-\ 074679 University of Moratuwa " 7 f < f 7 7 74679 Declaration The work presented in this dissertation has not been submitted for the fulfillment of any other degree. L3i D.G. Weerasinghe (Candidate) DrrfMrsT) Dileeka Dias (Supervisor) ABSTRACT A study based on several digital signal processing (DSP) techniques to be used in the development of a computer-based speech trainer for hearing impaired children is presented. Children with congenital hearing impairments have difficulties in speaking, and even in making the basic sounds associated with speech. Speech therapists use specialized training methods to train such children. The dearth of qualified speech therapists and other facilities hinder the speech development of many children in need of such training in most of the third world countries. The speech trainer described in this dissertation was developed as an alleviation to the above problem. The training tool developed, will aid a child with initial guidance from an adult, to master the pronunciation of initial sounds taught in a speech therapy programme, in a game-like environment, with only a PC having multimedia facilities. Three DSP techniques were studied for application to the trainer. The objective was to identify whether an utterance by a trainee was acceptable or not compared to an utterance by a normal person. The three techniques were based on spectral analysis, formant analysis and neural networks. The results with the spectral technique were found to be superior and were selected for use in the development of the training tool. In its current status, the training tool can guide children in pronouncing the five vowel sounds, the first step in a speech therapy course. i List of Figures 1.1 Block diagram of the speech trainer 3 3.1 Spectrograms for normal and hearing impaired speakers for /el/ sound 11 3.2 Flow chart for /a:/ sound 25 3.3 Flow chart for/ae/sound 26 3.4 Flow chart for /u:/ sound 27 3.5 Flow chart for lol sound 2 8 3.6 Flow chart for /el/ sound 29 3.7 Plot of target values and a correct attempt of a speaker 30 3.8 Plot of target values and an incorrect attempt of a speaker 30 4.1 Vocal tract for /a:/ sound 35 4.2 Vocal tract for /ae/ sound 35 4.3 Vocal tract for /u:/ sound 36 4.4 Vocal tract for lol sound 36 4.5 Vocal tract for /el/ sound 37 4.6 Location of formants for /a:/ sound 38 4.7 Location of formants for /as/ sound 38 4.8 Location of formants for lu:l sound 39 4.9 Location of formants for lol sound 39 4.10 Location of formants for/el/sound 40 4.11 Area for the location of /a:/ sound 43 4.12 Area for the location of /as/ sound 44 4.13 Area for the location of /u:/ sound 44 4.14 Area for the location of lol sound 45 4.15 Area for the location of /el/ sound 45 5.1 Structure of a 3 layer feed forward neural network 48 5.2 Inputs, outputs and weights of the network 49 5.3 Unipolar sigmoidal function 51 5.4 Bipolar sigmoidal function 51 5.5 Flow chart for back-propagation learning algorithm 53 7.1 Display sequence with all color balloons for correct pronunciation 62 7.2 Display with colorless balloons for incorrect pronunciation 63 7.3 Flow chart for visual interface operation 64 7.4 Project workspace of a modal dialog box 65 7.5 Self-learn speech trainer 66 7.6 Active dialog box 67 II 4 List of Tables 2.1 Different types of symbols used for vowels 7 3.1 Total power in each frequency band and the percentage 13 3.2 Percentages of correct decisions 14 3.3 Approach to the best possible algorithm for /a:/ sound 15 3.4 Approach to the best possible algorithm for /ae/ sound 16 3.5 Approach to the best possible algorithm for /u:/ sound 17 3.6 Approach to the best possible algorithm for lol sound 18 3.7 Approach to the best possible algorithm for /el/ sound 19 3.8 Extracted characteristics for/a:/sound 20 3.9 Extracted characteristics for /ae/ sound 21 3.10 Extracted characteristics for /u:/ sound 22 3.11 Extracted characteristics for lol sound 23 3.12 Extracted characteristics for /el/ sound 24 3.13 Percentages of correct decisions (Improved algorithms) 31 3.14 Results for new samples 32 4.1 Comparison of formant frequencies 40 4.2 Possible area for /a./ sound with percentages of correct decisions 41 4.3 Possible area for /ae/ sound with percentages of correct decisions 41 4.4 Possible area for /u.7 sound with percentages of correct decisions 42 4.5 Possible area for lol sound with percentages of correct decisions 42 4.6 Possible area for /el/ sound with percentages of correct decisions 43 4.7 Summary of results for percentages in formant method 46 5.1 Summary of test results for neural method 55 5.2-5.5 Test results for hearing impaired samples 57 5.6 Test results for combination of best results for each vowel 58 5.7 Summary of results for neural method 59 6.1 Comparison of accuracies of test results 60 m 4 CONTENTS Abstract I List of figures II List of tables HI Chapter 1 Introduction 1 1.1 Research background 1 1.2 Overview of the work 2 1.2.1 Speech training 2 1.2.2 Methods used for speech signal processing 2 1.2.3 Basic operation 3 Chapter 2 Speech Processing 5 2.1 Speech 5 2.1.1 Organs of speech 5 2.1.2 Speech production 5 2.1.3 Hearing and perception 5 2.1.4 Features of speech 6 2.1.5 Speech as symbols 6 2.2 Speech processing techniques 7 Chapter 3 Spectral Analysis 10 3.1 Spectrogram analysis of speech signals 10 3.2 Spectrographic speech processing 10 3.3 Evaluation of speech characteristics extraction and improvements 11 3.4 Best possible algorithms and flow charts 19 3.5 Coding into Matlab and testing 3 2 3.6 Real time speech recording 32 Chapter 4 Formant Estimation 34 4.1 Formant frequencies 34 4.2 Formant estimation 34 4.3 Vowel recognition using formants 37 4.4 Average formant values 40 4.5 Specific regions of vowels 40 IV Chapter 5 Neural Network Analysis 47 5.1 Neural network approach for vowel recognition 47 5.2 Selection of a suitable neural network 47 5.3 Designing a multi-layered neural network for vowel recognition 47 5.4 Selection of sigmoidal as activation function 50 5.5 Training procedure of the network 53 5.6 Testing the neural network 54 Chapter 6 Analysis of Results 60 6.1 Comparison of methods used 60 6.2 Comparison of results obtained 60 6.3 Method selected for the speech trainer 61 6.4 Possible improvements to formant estimation method 61 6.5 Possible improvements to neural network method 61 Chapter 7 Visual Interface 62 7.1 Training methodology and visual indication of results 62 7.2 Visual interface design 63 7.3 Conversion of Matlab into Visual C++ 64 7.4 Designing dialog boxes in Visual C++ 65 7.5 Connecting files to the dialog box 65 7.6 Operation of the speech trainer 66 7.7 Viewing a video clip 67 7.8 Training a vowel sound 67 Chapter 8 Conclusion 68 8.1 Problems encountered 68 8.2 Further improvements and future work 68 8.3 Summary 69 References 70 Appendix(A) (i) Matlab code to find power and percentages of power in each frequency band for normal speakers 72 (ii) Matlab code to find percentages of power in each frequency band for hearing impaired speakers 73 (iii) Matlab code for comparison of template values and speaker utterances according to selected algorithms and flow charts 73 (iv) Matlab code for real time speech recording and comparison 75 A' v Appendix(B) Power variation according to the number of frequency bands 77 Appendix(C) Percentages of power for normal and hearing impaired speaker samples 82 Appendix(D) Test results of the algorithm and results according to a normal listener 111 Appendix(E) Graphical representation of target values and the speaker performance 121 Appendix(F) Percentages of power for new samples 131 Appendix(G) Matlab source code for formant analysis 134 Appendix(H) (i) Initial weights applied for the neural network 136 (ii) Weight values after training the network 137 Appendix(I) Matlab source code for training and testing neural network 140 4 Appendix(J) Matlab source code for visual indication of results 142 Appendix(K) Visual C++ source code for the speech trainer 146 VI I A. Acknowledgement The author is indebted to Dr. Dileeka Dias for all the valuable guidance, advice, encouragement, inspiration and most of all for proposing this title for the dissertation. Prof. I. J. Dayawansa Dr. Nishantha Nanayakkara for the valuable comments and suggestions provided at the progress review sessions. Prof. N. Ratnayaka Director - Post graduate research studies. Asian Development Bank for financial support. Dr. Gihan Dias and the members of the Research Group Mr. Ruwan Gajaweera Koojana, Janaka, Namunu and Sankassa Mr. Jayantha Perera and Mr. Philip Terrence D.D. Sumanapala and Thushara My parents, wife, daughter and son.