INEXACT MATCHING OF PROPER NAMES 
IN SINHALA 
 
 
 
 
M.SC. IN COMPUTER SCIENCE 
 
 
 
 
S. C. FERNANDO 
 
 UNIVERSITY OF MORATUWA, SRI LANKA 
DECEMBER 2007
  
INEXACT MATCHING OF PROPER NAMES 
IN SINHALA 
 
 
 
S. C. FERNANDO 
 
 
This dissertation was submitted to the 
Department of Computer Science and Engineering 
of the University of Moratuwa 
in partial fulfilment of the requirements for the 
Degree of M.Sc. in Computer Science 
specializing in Software Architecture 
 
 
Department of Computer Science and Engineering 
University of Moratuwa, Sri Lanka 
December 2007
ii 
 
DECLARATION 
I, S. C. Fernando hereby declare that the work included in this dissertation in part or 
whole has not been submitted for any other academic qualification at any institution. 
 
 
Prof. Gihan Dias       S. C. Fernando 
Supervisor 
iii 
ABSTRACT 
With the advancement of technology, the need for maintaining national data and information 
becomes important. Most of these data and information have to be maintained in the local 
languages because majority of the Sri Lankans are still not very conversant in English. 
Therefore when public organizations embrace IT, their data including personal data has to be 
maintained in local languages. When data and information are available in the local language, 
searching and retrieving them using the local language become essential. 
Proper nouns have an inherent problem because a given proper noun, for example a name can 
be spelt in several different ways. This problem becomes more prominent when a name from 
one language origin is spelt using another language. For example, the Sinhala name  ශාඛා can 
be spelt in several ways such as  සාකා,  සාඛා or  ශාකා using Sinhala itself. Therefore, one 
who would search an information store for a proper name may not encounter a match, if a 
different spelling is used to search from that being stored. 
This research was to provide a solution to the problem mentioned above using Sinhala 
language. That is to build a rule based search application that would take a Sinhala input 
string, search an information store and retrieve matching results even if they were stored with 
a different spelling. 
This was achieved by building a rule base to replace characters of a key word with different 
characters in order to generate a set of words with different spelling. Then this set of words is 
searched in the information store and results are displayed. Rules were organized in different 
levels so that the user can select the level of character replacement, thus it would retrieve 
matches with a slight spelling difference or retrieve matches with drastic spelling differences. 
A special rule set was built for matching Tamil names written in Sinhala. The user has option 
to independently enable/disable this rule set. An application, which uses a general-purpose 
rule engine to process rules was designed and implemented to demonstrate this technology. 
This application consist of a web based user interface and a sample database as the 
information store. This was designed in a layered architecture such that future expansions and 
component reuse can be done. All character replacement rules are declared in text files, so 
changes and updates to the rule base can be done without modifying the system. 
It is shown that the application, with the rule base that was built, will provide a solution to the 
proper name search problem stated above. This system can be integrated with future 
information systems in government and business organisations. 
iv 
ACKNOWLEDGEMENTS 
I would like to express my gratitude to all those who helped me to successfully complete the 
M. Sc. program and this research project.  
First and foremost, I offer my deepest gratitude to my supervisor Prof. Gihan Dias, who 
envisioned this research idea. Despite his busy schedules, he persistently spent adequate time 
to review the progress, guide and direct me throughout the research period.  His advice 
regarding various aspects of multilingual computing was invaluable in completing this 
research project successfully. 
I would also like to thank Dr. Sanath Jayasena, my co-supervisor for his continual reviews 
that helped a lot in completing the research and this dissertation in a timely manner. The 
weekly activities and progress reviews organized by him helped to keep my focus on the 
research. 
I am grateful to the other staff members of the Department of Computer Science and 
Engineering and visiting lecturers who had given useful advice and direction either during 
progress reviews or during lectures. Contents in some of the course modules in the M. Sc. 
programme were directly relevant to this research and the dissertation. My thanks go to all the 
lecturers who conducted those modules. 
My sincere thanks go to the researchers in LTRL of UCSC who had published valuable 
resources relevant to multilingual computing in their site. Some of these resources were 
instrumental in completing this research project. I also thank the past researchers and authors 
of materials that I reviewed for my research. 
I am indebted to the open source community for providing powerful software tools, 
documentations and submitting forum posts, which were crucial in completing this research 
project successfully. JBossRules (Drools), MySQL, Apache Tomcat, Log4J and Java 
technologies were essential software for this research project. 
Completion of this research project would not have been possible without the relentless 
support from my family; I thank them for their understanding and support. I also wish to thank 
my employer for giving adequate time off from office, to do my study work. Finally, my 
thanks go to my colleagues, friends and all the others who helped. 
v 
TABLE OF CONTENTS 
 
DECLARATION  ............................................................................................................ II 
ABSTRACT  ........................................................................................................... III 
ACKNOWLEDGEMENTS ............................................................................................... IV 
TABLE OF CONTENTS ................................................................................................... V 
LIST OF FIGURES ...................................................................................................... VIII 
LIST OF ABBREVIATIONS AND ACRONYMS .................................................................. IX 
CHAPTER 1. INTRODUCTION ..................................................................................... 1 
1.1 Background .............................................................................................. 1 
1.2 Data and Information in Local Language .................................................. 1 
1.3 Complications due to Spelling Variations of Proper Nouns ....................... 2 
1.4 Research Objectives, Scope and Deliverables ............................................ 3 
CHAPTER 2. LITERATURE REVIEW ............................................................................ 5 
2.1 Unicode/Sinhala and Other Technical Literature ....................................... 5 
2.2 Purely Linguistic Literature ...................................................................... 7 
2.3 Literature in Multilingual Computing Domain .......................................... 8 
CHAPTER 3. METHODOLOGY AND APPROACH ......................................................... 12 
3.1 Methodology .......................................................................................... 12 
3.1.1 Iterative Methodology ..................................................................... 12 
3.1.2 Risk Mitigation ................................................................................ 12 
3.1.3 Prototyping ..................................................................................... 12 
3.1.4 Parallel Tasks ................................................................................. 13 
3.2 High-Level Approach ............................................................................. 13 
3.2.1 Selection of Technique for the Main Logic Engine ........................... 14 
vi 
3.2.2 High-Level Approach of the Main Logic Engine .............................. 14 
3.3 Detailed Approach .................................................................................. 16 
3.3.1 Need of a Rule Engine ..................................................................... 16 
3.3.2 Break up of a Word and Unit of Replacement .................................. 17 
3.3.3 Data Structure to Store Input Word, Characters and Replacements . 18 
3.3.4 Identification and Derivation of Rules ............................................. 19 
3.3.5 Capture and Display of Sinhala Strings ........................................... 20 
3.3.6 Database and Data Access .............................................................. 21 
3.3.7 List of Concerns .............................................................................. 22 
CHAPTER 4. TECHNICAL DESIGN AND IMPLEMENTATION ........................................ 24 
4.1 Process Flow .......................................................................................... 24 
4.1.1 Validate Input Word ........................................................................ 24 
4.1.2 Tokenize Input String... .................................................................... 25 
4.1.3 Apply Token Replacements .............................................................. 25 
4.1.4 Generate List of Strings ................................................................... 25 
4.1.5 Search the Database... ..................................................................... 26 
4.1.6 Apply Display Logic ........................................................................ 26 
4.2 High-Level Design ................................................................................. 26 
4.2.1 UI – Facade .................................................................................... 28 
4.2.2 Common Utilities ............................................................................ 28 
4.2.3 Word Generator .............................................................................. 29 
4.2.4 Data Access .................................................................................... 29 
4.3 Detailed Design and Implementation....................................................... 30 
4.3.1 User Interface ................................................................................. 30 
4.3.2 Main Logic Engine .......................................................................... 33 
4.3.3 Rule Definition and Rule Sets .......................................................... 39 
4.3.4 Database Design ............................................................................. 40 
vii 
CHAPTER 5. TESTING, EVALUATION AND RESULTS.................................................. 43 
5.1 Component Level and Integrated Testing ................................................ 43 
5.1.1 Unit Testing using JUnit .................................................................. 43 
5.1.2 Printing Debug Logs Using Log4J ................................................... 44 
5.2 Generation of Test Data .......................................................................... 45 
5.3 Rule Testing and Evaluation ................................................................... 45 
5.4 Third Party Evaluation ............................................................................ 48 
5.4.1 Evaluation Method .......................................................................... 48 
5.4.2 Results analysis ............................................................................... 48 
5.5 User Experience Evaluation and Improvements....................................... 49 
5.5.1 Part word matching ......................................................................... 49 
5.5.2 Multi-Word Search .......................................................................... 49 
5.5.3 Input Word Retained in the Search Box ........................................... 49 
5.6 Performance Evaluation .......................................................................... 50 
CHAPTER 6. CONCLUSION AND FUTURE WORK ....................................................... 51 
6.1 Conclusion ............................................................................................. 51 
6.2 Future Work ........................................................................................... 52 
6.2.1 User friendly Rule Authoring Interface ............................................ 52 
6.2.2 Expanding in to Other Languages ................................................... 52 
6.2.3 Combining with Other Multilingual Applications ............................. 52 
6.2.4 Inverted Index Lookup instead of a Database................................... 53 
6.2.5 Improved Intelligence ...................................................................... 53 
6.2.6 Auto Correction or Search Assist ..................................................... 53 
6.2.7 Performance Improvements ............................................................. 54 
REFERENCES  .......................................................................................................... 55 
APPENDIX A. CHARACTER REPLACEMENT RULES ..................................................... 58 
APPENDIX B. THIRD PARTY EVALUATION RESULTS.................................................. 63
viii 
 
LIST OF FIGURES 
 
Figure 1: High-level Input, Output Process Flow ...................................................... 24 
Figure 2: Component Design .................................................................................... 27 
Figure 3: User Interface – Welcome Page ................................................................. 30 
Figure 4: User Interface – Results Page .................................................................... 31 
Figure 5: High-level Call Sequence .......................................................................... 34 
Figure 6: Illustration of Word Holder Data Structure ................................................ 35 
Figure 7: Replacement Characters and Word Generation .......................................... 36 
Figure 8: Database Schema Design ........................................................................... 41 
 
ix 
 
LIST OF ABBREVIATIONS AND ACRONYMS 
 
DBMS Database Management System 
DSL Domain Specific Language 
HMM Hidden Markov Model 
HTML Hyper Text Mark-up Language 
IPA International Phonetic Alphabet 
IT  Information Technology 
JDBC Java Database Connectivity 
JSP Java Server Pages 
JSTL Java Server Pages Standard Tag Library 
LTRL Language Technology Research Laboratory 
OOV Out of Vocabulary 
POC Proof of Concept 
SDK Standard Development Kit 
UCSC University of Colombo School of Computing 
UI  User Interface 
UTF Unicode Transformation Format 
XML eXtensible Mark-up Language 
ZWJ Zero Width Joiner