In webrobot detection it is important is to find features that are common characteristics of diverse robots, in order to differentiate between them and humans. Existing approaches employ fairly simple features (e.g. ...
详细信息
In webrobot detection it is important is to find features that are common characteristics of diverse robots, in order to differentiate between them and humans. Existing approaches employ fairly simple features (e.g. empty referrer field, interval between successive requests), which often fail to reflect webrobots' behaviour accurately. False alarms may therefore occur unacceptably often. In this paper we propose a fresh approach that expresses the behaviour of interactive users and various webrobots in terms of a sequence of request types, called request patterns. Previous proposals have primarily targeted the detection of text crawlers, but our approach works well on many other webrobots, such as image crawlers, email collectors and link checkers. In empirical evaluation of more than 1 billion requests collected at ***, our approach achieved 94% accuracy in webrobot detection, estimated by F-measure. A decision tree algorithm proposed by Tan and Kumar was also applied to the same data. A comparison shows that the proposed approach is more accurate, and that real-time detection of webrobots is feasible.
暂无评论