|
Abstract
|
The access of the Internet to almost every user in recent years has caused various benefits and drawbacks. Users can access and benefit from a lot of information and communicate with other people comfortably. On the contrary, users are faced with many threats. Many people are victimized over the internet due to malicious software and deceptive systems. Users frequently encounter issues with spam emails, spam websites, and phishing malware. Considering that internet use is indispensable anymore; it is important to develop such systems to protect users from malicious software. Therefore, eight prominent machine learning algorithms were utilized in the study to identify spam URLs with using a big data collection. To carry out these methods, it is aimed to acquire a highly accomplished result by adding attentively selected features. Because the appropriate level of feature extraction is very effective on the results to be obtained from the methods. Since the dataset contains only the URL and whether there is spam or not, it has become necessary to make some feature inferences such as the URL length, the number of digits it contains. By virtue of such content increases, machine learning can evaluate the decision-making process in a more efficient comparison network. As a result of the experimental processes, it has been determined that tree-based machine learning algorithms give better results. In all methods, the detection success of the non-spam class was higher due to the distribution in the dataset. The Random Forest approach found a detection success of 96.3% for the highest non-spam class. Similarly, 94.2% accuracy was achieved for both spam and non-spam URL detection using combined and weighted results in same method.
|