Show simple item record

dc.contributor.author Algiriyage, N
dc.date.accessioned 2015-09-16T04:57:59Z
dc.date.available 2015-09-16T04:57:59Z
dc.date.issued 2015-09-16
dc.identifier.uri http://dl.lib.mrt.ac.lk/handle/123/11341
dc.description.abstract With the evolution of the Internet and continuous growth of the global information infrastructure, the amount of data collected online from transactions and events has been drastically increased. Web server access log files collect substantial data about web visitor access patterns. Data mining techniques can be applied on such data (which is known as Web Mining) to reveal lot of useful information about navigational patterns. In this research we analyze the patterns of web crawlers and human visitors through web server access log files. The objectives of this research are to detect web crawlers, identify suspicious crawlers, detect Googlebot impersonation and profile human visitors. During human visitor profiling we group similar web visitors into clusters based on their browsing patterns and profile them. We show that web crawlers can be identified and successfully classified using heuristics. We evaluated our proposed methodology using seven test crawler scenarios. We found that approximately 53.25% of web crawler sessions were from â ˘ AIJknownâ˘A ˙I crawlers and 34.16% exhibit suspicious behavior. We present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake. For the purpose of human visitor profiling, an improved similarity measure is proposed and it is used as the distance measure in an agglomerative hierarchical clustering for a data set from an e-commerce web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure. en_US
dc.language.iso en en_US
dc.subject Keywords: access logs, crawlers, web users, web usage mining en_US
dc.title Detecting access patterns through analysis of web logs en_US
dc.type Thesis-Abstract en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree MSc. en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2015
dc.identifier.accno 109008 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record