Is there a function or method I could call in Python
That would tell me if the data is RSS or HTML?
Is there a function or method I could call in Python That would tell
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Filetypes should generally be determined out-of-band. eg. if you are fetching the file from a web server, the place to look would be the
Content-Typeheader of the HTTP response. If you’re fetching a local file, the filesystem would have a way of determining filetype—on Windows that’d be looking at the file extension.If none of that is available, you’d have to resort to content sniffing. This is never wholly reliable, and RSS is particularly annoying because there are multiple incompatible versions of it, but about the best you could do would probably be:
Attempt to parse the content with an XML parser. If it fails, the content isn’t well-formed XML so can’t be RSS.
Look at the
document.documentElement.namespaceURI. If it’shttp://www.w3.org/1999/xhtml, you’ve got XHTML. If it’shttp://www.w3.org/1999/02/22-rdf-syntax-ns#, you’ve got RSS (of one flavour).If the
document.documentElement.tagNameisrss, you’ve got RSS (of a slightly different flavour).If the file couldn’t be parsed as XML, it could well be HTML (or some tag-soup approximation of it). It’s conceivable it might also be broken RSS. In that case most feed tools would reject it. If you need to still detect this case you’d be reduced to looking for strings like
<htmlor<rssor<rdf:RSSnear the start of the file. This would be even more unreliable.