I want to make something like readability, which extracts only the article text from any page and removes everything else…
I am using file_get_contents to get a webpage and this works fine.
After I get that, how can I extract out just the main article text using PHP?
Are there any plugins or is there a way to do it?
There are many libraries that help you parse HTML, and more than a few questions on SO that cover them (such as this one), but that’s not your biggest problem.
Your issue is going to be how to determine what exactly is the main article. You could potentially determine what element has the most
<p>tags as children, but there’s no reason I can’t make a CMS that doesn’t use<p>tags at all.