I want to parse the html file, pdf file, csv file and text file.Now parsing for which type of file (specified above) is easiest and efficient ?
Because I want to parse pdf ,html ,csv and text file through common parsing code if possible.
And now suppose if parsing for html is easiest and efficient then :
I will write the parsing code for html file and will try to convert pdf file to the html file(if possible)so the code written for parsing html file will also work for pdf file also.
And thus I will try to convert pdf,csv and text file to html file.And write the code for parsing html file and thus this code will parse html,pdf,csv and text file.
So (1) Which type of file parsing is easiest and efficient (pdf,csv,html,text) ?
(2) And converting files(pdf,text,html,csv) to eachother is possible.
Like if html parsing easiest then pdf to html,text to html and csv to html.
You cannot parse all of the above file types with the same parser code.
The simplest format text – CSV and HTML are text files. Having said that, it doesn’t mean that they are simple to parse. It really depends on what formatting they have.
PDF files are binary in nature, so will require a different parser.
In general, the more structured the data, the easier the parsing (so, CSV would be easiest and probably fastest).
I would suggest using existing parser instead of writing your own.
There are libraries around that will parse CSV and some other types of structured text (tab delimited for example) – see FileHelpers.
For HTML parsing there is the HTML Agilty Pack.
There are numerous PDF parsers, both free and commercial.