We are dynamically creating PDF using itext in our application. The content of the PDF is inserted by the user in the web application using a screen where he has a Rich Text Editor.
Below are the steps specifically.
- User goes to a add PDF content page.
- The add page has a Rich text Editor where he can enter the PDF content.
- Sometimes user can copy/paste the content from the existing word document and enter in the RTE.
- Once he submits the content, PDF is created.
The RTE is used because we have some other pages where we need to show the content with BOLD, italics etc.
But, we don’t want this RTE stuff in the PDF being generated.
We have used some java utility to remove the RTE stuff from the content before generating the PDF.
This works normally but when the content is copied from the word document, html and css styles applied by the document are not being removed by the java utility we are using.
How can I generate the PDF without any HTML or CSS in it?
Here is the code
Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);
And the removeHTML method is as below
public static String removeHTML(String htmlString) {
if (htmlString == null)
return "";
htmlString.replace("\"", "'");
htmlString = htmlString.replaceAll("\\<.*?>", "");
htmlString = htmlString.replaceAll(" ", "");
return htmlString;
}
And below is the additional content being shown in the PDF when I copy/paste from the word document.
<w:LsdException Locked="false" Priority="10" SemiHidden="false
UnhideWhenUsed="false" QFormat="true" Name="Title" />
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" />
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
Please help !
Thanks.
Our application is similar, we have a Rich Text Editor (TinyMCE), and our output is PDF generated via iText PDF. We want to have the HTML as clean as possible, and ideally only using the HTML tags supported by iText’s HTMLWorker. TinyMCE can do some of this, but there are still situations where an end user can submit HTML which is really screwed up, and which can possibly break iText’s ability to generate a PDF.
We’re using a combination of jSoup and jTidy + CSSParser to filter out unwanted CSS styles entered in HTML “style” attributes. HTML entered into TinyMCE is scrubbed using this service which cleans up any paste from word markup (if the user didn’t use the Paste From Word button in TinyMCE) and gives us HTML that translates well for iTextPDFs HTMLWorker.
I also found issues with table widths in iText’s HTMLWorker parser (5.0.6) if the table width is in the style attribute, HTMLWorker ignores it and sets the table width to 0, so this is some logic to fix that below. We use the following libs: a
Below is some code from a Groovy service we built to scrub the HTML and only keep the tags and style attributes supported by iText + fixes the table issue. There are a few assumptions made in the code which is specific to our application. This is working really well for us at the moment.