Implementation of Anti-Crawler System Based on Spark

Yisong Wang; Dongmei Zhang¹

Publication Date: 2022/01/08

Abstract: With the advent of the data age, the extraction and utilization of data information has become a huge challenge. The crawler algorithm is designed to obtain website information in batches. However, the use of some malicious crawlers has interfered with the normal business and operation of the website, such as website ticket grabbing behavior and so on. So anti-reptiles was proposed as a new research topic. From the initial frontend anti-crawler, an anti-crawler system based on big data emerged, which greatly improved the efficiency of anti-crawler. The purpose of this topic is to develop an anti-crawler system. After conducting certain research on anti-crawler strategies and technologies, it is determined that the system functions include data classification, data landing, data processing, data access, and ip sensitive representation. The goal is to meet the anti-crawler needs of ticketing websites, ensure normal business operations, and improve user satisfaction. The system adopts technologies such as spark, redis, kafka, nginx + lua, and uses idea as a development tool. After the development of the system is completed, it has undergone functional and performance tests. Its functions are simple and convenient, with good accuracy, and good scalability, which can meet development needs.

Keywords: anti-reptile; hadoop; spark; redis; kafka; nginx

DOI: No DOI Available

PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT21NOV546_(1).pdf

REFERENCES

No References Available

Implementation of Anti-Crawler System Based on Spark

Yisong Wang; Dongmei Zhang1

Yisong Wang; Dongmei Zhang¹