I want to write an online application that:
- reads the URL from address bar of the browser
- extracts its lexical features (like n-grams)
- extracts its host based features (fetch DNS records online, its A, PTR, TTL fields)
- classify the URL into malicious or benign (using machine learning)
Can anyone help me with 1 and 3?
I don’t believe this (application) is a task you can accomplish, as you can’t really determine site content based on url.
See something like Mozilla Phishing Protection Design Documentation and Google Safe Browsing spec instead