INEXACT MATCHING OF PROPER NAMES IN SINHALA M.SC. IN COMPUTER SCIENCE S. C. FERNANDO UNIVERSITY OF MORATUWA, SRI LANKA DECEMBER 2007 INEXACT MATCHING OF PROPER NAMES IN SINHALA S. C. FERNANDO This dissertation was submitted to the Department of Computer Science and Engineering of the University of Moratuwa in partial fulfilment of the requirements for the Degree of M.Sc. in Computer Science specializing in Software Architecture Department of Computer Science and Engineering University of Moratuwa, Sri Lanka December 2007 ii DECLARATION I, S. C. Fernando hereby declare that the work included in this dissertation in part or whole has not been submitted for any other academic qualification at any institution. Prof. Gihan Dias S. C. Fernando Supervisor iii ABSTRACT With the advancement of technology, the need for maintaining national data and information becomes important. Most of these data and information have to be maintained in the local languages because majority of the Sri Lankans are still not very conversant in English. Therefore when public organizations embrace IT, their data including personal data has to be maintained in local languages. When data and information are available in the local language, searching and retrieving them using the local language become essential. Proper nouns have an inherent problem because a given proper noun, for example a name can be spelt in several different ways. This problem becomes more prominent when a name from one language origin is spelt using another language. For example, the Sinhala name ශාඛා can be spelt in several ways such as සාකා, සාඛා or ශාකා using Sinhala itself. Therefore, one who would search an information store for a proper name may not encounter a match, if a different spelling is used to search from that being stored. This research was to provide a solution to the problem mentioned above using Sinhala language. That is to build a rule based search application that would take a Sinhala input string, search an information store and retrieve matching results even if they were stored with a different spelling. This was achieved by building a rule base to replace characters of a key word with different characters in order to generate a set of words with different spelling. Then this set of words is searched in the information store and results are displayed. Rules were organized in different levels so that the user can select the level of character replacement, thus it would retrieve matches with a slight spelling difference or retrieve matches with drastic spelling differences. A special rule set was built for matching Tamil names written in Sinhala. The user has option to independently enable/disable this rule set. An application, which uses a general-purpose rule engine to process rules was designed and implemented to demonstrate this technology. This application consist of a web based user interface and a sample database as the information store. This was designed in a layered architecture such that future expansions and component reuse can be done. All character replacement rules are declared in text files, so changes and updates to the rule base can be done without modifying the system. It is shown that the application, with the rule base that was built, will provide a solution to the proper name search problem stated above. This system can be integrated with future information systems in government and business organisations. iv ACKNOWLEDGEMENTS I would like to express my gratitude to all those who helped me to successfully complete the M. Sc. program and this research project. First and foremost, I offer my deepest gratitude to my supervisor Prof. Gihan Dias, who envisioned this research idea. Despite his busy schedules, he persistently spent adequate time to review the progress, guide and direct me throughout the research period. His advice regarding various aspects of multilingual computing was invaluable in completing this research project successfully. I would also like to thank Dr. Sanath Jayasena, my co-supervisor for his continual reviews that helped a lot in completing the research and this dissertation in a timely manner. The weekly activities and progress reviews organized by him helped to keep my focus on the research. I am grateful to the other staff members of the Department of Computer Science and Engineering and visiting lecturers who had given useful advice and direction either during progress reviews or during lectures. Contents in some of the course modules in the M. Sc. programme were directly relevant to this research and the dissertation. My thanks go to all the lecturers who conducted those modules. My sincere thanks go to the researchers in LTRL of UCSC who had published valuable resources relevant to multilingual computing in their site. Some of these resources were instrumental in completing this research project. I also thank the past researchers and authors of materials that I reviewed for my research. I am indebted to the open source community for providing powerful software tools, documentations and submitting forum posts, which were crucial in completing this research project successfully. JBossRules (Drools), MySQL, Apache Tomcat, Log4J and Java technologies were essential software for this research project. Completion of this research project would not have been possible without the relentless support from my family; I thank them for their understanding and support. I also wish to thank my employer for giving adequate time off from office, to do my study work. Finally, my thanks go to my colleagues, friends and all the others who helped. v TABLE OF CONTENTS DECLARATION ............................................................................................................ II ABSTRACT ........................................................................................................... III ACKNOWLEDGEMENTS ............................................................................................... IV TABLE OF CONTENTS ................................................................................................... V LIST OF FIGURES ...................................................................................................... VIII LIST OF ABBREVIATIONS AND ACRONYMS .................................................................. IX CHAPTER 1. INTRODUCTION ..................................................................................... 1 1.1 Background .............................................................................................. 1 1.2 Data and Information in Local Language .................................................. 1 1.3 Complications due to Spelling Variations of Proper Nouns ....................... 2 1.4 Research Objectives, Scope and Deliverables ............................................ 3 CHAPTER 2. LITERATURE REVIEW ............................................................................ 5 2.1 Unicode/Sinhala and Other Technical Literature ....................................... 5 2.2 Purely Linguistic Literature ...................................................................... 7 2.3 Literature in Multilingual Computing Domain .......................................... 8 CHAPTER 3. METHODOLOGY AND APPROACH ......................................................... 12 3.1 Methodology .......................................................................................... 12 3.1.1 Iterative Methodology ..................................................................... 12 3.1.2 Risk Mitigation ................................................................................ 12 3.1.3 Prototyping ..................................................................................... 12 3.1.4 Parallel Tasks ................................................................................. 13 3.2 High-Level Approach ............................................................................. 13 3.2.1 Selection of Technique for the Main Logic Engine ........................... 14 vi 3.2.2 High-Level Approach of the Main Logic Engine .............................. 14 3.3 Detailed Approach .................................................................................. 16 3.3.1 Need of a Rule Engine ..................................................................... 16 3.3.2 Break up of a Word and Unit of Replacement .................................. 17 3.3.3 Data Structure to Store Input Word, Characters and Replacements . 18 3.3.4 Identification and Derivation of Rules ............................................. 19 3.3.5 Capture and Display of Sinhala Strings ........................................... 20 3.3.6 Database and Data Access .............................................................. 21 3.3.7 List of Concerns .............................................................................. 22 CHAPTER 4. TECHNICAL DESIGN AND IMPLEMENTATION ........................................ 24 4.1 Process Flow .......................................................................................... 24 4.1.1 Validate Input Word ........................................................................ 24 4.1.2 Tokenize Input String... .................................................................... 25 4.1.3 Apply Token Replacements .............................................................. 25 4.1.4 Generate List of Strings ................................................................... 25 4.1.5 Search the Database... ..................................................................... 26 4.1.6 Apply Display Logic ........................................................................ 26 4.2 High-Level Design ................................................................................. 26 4.2.1 UI – Facade .................................................................................... 28 4.2.2 Common Utilities ............................................................................ 28 4.2.3 Word Generator .............................................................................. 29 4.2.4 Data Access .................................................................................... 29 4.3 Detailed Design and Implementation....................................................... 30 4.3.1 User Interface ................................................................................. 30 4.3.2 Main Logic Engine .......................................................................... 33 4.3.3 Rule Definition and Rule Sets .......................................................... 39 4.3.4 Database Design ............................................................................. 40 vii CHAPTER 5. TESTING, EVALUATION AND RESULTS.................................................. 43 5.1 Component Level and Integrated Testing ................................................ 43 5.1.1 Unit Testing using JUnit .................................................................. 43 5.1.2 Printing Debug Logs Using Log4J ................................................... 44 5.2 Generation of Test Data .......................................................................... 45 5.3 Rule Testing and Evaluation ................................................................... 45 5.4 Third Party Evaluation ............................................................................ 48 5.4.1 Evaluation Method .......................................................................... 48 5.4.2 Results analysis ............................................................................... 48 5.5 User Experience Evaluation and Improvements....................................... 49 5.5.1 Part word matching ......................................................................... 49 5.5.2 Multi-Word Search .......................................................................... 49 5.5.3 Input Word Retained in the Search Box ........................................... 49 5.6 Performance Evaluation .......................................................................... 50 CHAPTER 6. CONCLUSION AND FUTURE WORK ....................................................... 51 6.1 Conclusion ............................................................................................. 51 6.2 Future Work ........................................................................................... 52 6.2.1 User friendly Rule Authoring Interface ............................................ 52 6.2.2 Expanding in to Other Languages ................................................... 52 6.2.3 Combining with Other Multilingual Applications ............................. 52 6.2.4 Inverted Index Lookup instead of a Database................................... 53 6.2.5 Improved Intelligence ...................................................................... 53 6.2.6 Auto Correction or Search Assist ..................................................... 53 6.2.7 Performance Improvements ............................................................. 54 REFERENCES .......................................................................................................... 55 APPENDIX A. CHARACTER REPLACEMENT RULES ..................................................... 58 APPENDIX B. THIRD PARTY EVALUATION RESULTS.................................................. 63 viii LIST OF FIGURES Figure 1: High-level Input, Output Process Flow ...................................................... 24 Figure 2: Component Design .................................................................................... 27 Figure 3: User Interface – Welcome Page ................................................................. 30 Figure 4: User Interface – Results Page .................................................................... 31 Figure 5: High-level Call Sequence .......................................................................... 34 Figure 6: Illustration of Word Holder Data Structure ................................................ 35 Figure 7: Replacement Characters and Word Generation .......................................... 36 Figure 8: Database Schema Design ........................................................................... 41 ix LIST OF ABBREVIATIONS AND ACRONYMS DBMS Database Management System DSL Domain Specific Language HMM Hidden Markov Model HTML Hyper Text Mark-up Language IPA International Phonetic Alphabet IT Information Technology JDBC Java Database Connectivity JSP Java Server Pages JSTL Java Server Pages Standard Tag Library LTRL Language Technology Research Laboratory OOV Out of Vocabulary POC Proof of Concept SDK Standard Development Kit UCSC University of Colombo School of Computing UI User Interface UTF Unicode Transformation Format XML eXtensible Mark-up Language ZWJ Zero Width Joiner