Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4609548
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T01:00:11+00:00 2026-05-22T01:00:11+00:00

We are dynamically creating PDF using itext in our application. The content of the

  • 0

We are dynamically creating PDF using itext in our application. The content of the PDF is inserted by the user in the web application using a screen where he has a Rich Text Editor.

Below are the steps specifically.

  1. User goes to a add PDF content page.
  2. The add page has a Rich text Editor where he can enter the PDF content.
  3. Sometimes user can copy/paste the content from the existing word document and enter in the RTE.
  4. Once he submits the content, PDF is created.

The RTE is used because we have some other pages where we need to show the content with BOLD, italics etc.

But, we don’t want this RTE stuff in the PDF being generated.

We have used some java utility to remove the RTE stuff from the content before generating the PDF.

This works normally but when the content is copied from the word document, html and css styles applied by the document are not being removed by the java utility we are using.

How can I generate the PDF without any HTML or CSS in it?

Here is the code

Paragraph paragraph = new Paragraph(Util.removeHTML(content), font);

And the removeHTML method is as below

public static String removeHTML(String htmlString) {
    if (htmlString == null)
        return "";
    htmlString.replace("\"", "'");
    htmlString = htmlString.replaceAll("\\<.*?>", "");
    htmlString = htmlString.replaceAll("&nbsp;", "");
    return htmlString;
}

And below is the additional content being shown in the PDF when I copy/paste from the word document.

<w:LsdException Locked="false" Priority="10" SemiHidden="false
UnhideWhenUsed="false" QFormat="true" Name="Title" />
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle" />
<w:LsdException Locked="false" Priority="22" SemiHidden="false"

Please help !

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T01:00:12+00:00Added an answer on May 22, 2026 at 1:00 am

    Our application is similar, we have a Rich Text Editor (TinyMCE), and our output is PDF generated via iText PDF. We want to have the HTML as clean as possible, and ideally only using the HTML tags supported by iText’s HTMLWorker. TinyMCE can do some of this, but there are still situations where an end user can submit HTML which is really screwed up, and which can possibly break iText’s ability to generate a PDF.

    We’re using a combination of jSoup and jTidy + CSSParser to filter out unwanted CSS styles entered in HTML “style” attributes. HTML entered into TinyMCE is scrubbed using this service which cleans up any paste from word markup (if the user didn’t use the Paste From Word button in TinyMCE) and gives us HTML that translates well for iTextPDFs HTMLWorker.

    I also found issues with table widths in iText’s HTMLWorker parser (5.0.6) if the table width is in the style attribute, HTMLWorker ignores it and sets the table width to 0, so this is some logic to fix that below. We use the following libs: a

    com.itextpdf:itextpdf:5.0.6                 // used to generate PDFs
    org.jsoup:jsoup:1.5.2                       // used for cleaning HTML, primary cleaner
    net.sf.jtidy:jtidy:r938                     // used for cleaning HTML, secondary cleaner
    net.sourceforge.cssparser:cssparser:0.9.5   // used to parse out unwanted HTML "style" attribute values
    

    Below is some code from a Groovy service we built to scrub the HTML and only keep the tags and style attributes supported by iText + fixes the table issue. There are a few assumptions made in the code which is specific to our application. This is working really well for us at the moment.

    import com.steadystate.css.parser.CSSOMParser
    import org.htmlcleaner.CleanerProperties
    import org.htmlcleaner.HtmlCleaner;
    import org.htmlcleaner.PrettyHtmlSerializer
    import org.htmlcleaner.SimpleHtmlSerializer
    import org.htmlcleaner.TagNode
    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.safety.Cleaner
    import org.jsoup.safety.Whitelist
    import org.jsoup.select.Elements
    import org.w3c.css.sac.InputSource
    import org.w3c.dom.css.CSSRule
    import org.w3c.dom.css.CSSRuleList
    import org.w3c.dom.css.CSSStyleDeclaration
    import org.w3c.dom.css.CSSStyleSheet
    import org.w3c.tidy.Tidy
    
    class HtmlCleanerService {
    
        static transactional = true
    
        def cleanHTML(def html) {
    
            // clean with JSoup which should filter out most unwanted things and
            // ensure good html syntax
            html = soupClean(html);
    
            // run through JTidy to remove repeated nested tags, clean anything JSoup left out
            html = tidyClean(html);
    
            return html;
        }
    
        def tidyClean(def html) {
            Tidy tidy = new Tidy() 
            tidy.setAsciiChars(true)
            tidy.setDropEmptyParas(true)
            tidy.setDropProprietaryAttributes(true)
            tidy.setPrintBodyOnly(true)
    
            tidy.setEncloseText(true)
            tidy.setJoinStyles(true)
            tidy.setLogicalEmphasis(true)
            tidy.setQuoteMarks(true)
            tidy.setHideComments(true)
            tidy.setWraplen(120)
    
            // (makeClean || dropFontTags) = replaces presentational markup by style rules
            tidy.setMakeClean(true)     // remove presentational clutter.
            tidy.setDropFontTags(true)  
    
            // word2000 = drop style & class attributes and empty p, span elements
            // draconian cleaning for Word2000
            tidy.setWord2000(true)      
            tidy.setMakeBare(true)      // remove Microsoft cruft.
            tidy.setRepeatedAttributes(org.w3c.tidy.Configuration.KEEP_FIRST) // keep first or last duplicate attribute
    
            // TODO ? tidy.setForceOutput(true)
    
            def reader = new StringReader(html);
            def writer = new StringWriter();
    
            // hide output from stderr
            tidy.setShowWarnings(false)
            tidy.setErrout(new PrintWriter(new StringWriter()))
    
            tidy.parse(reader, writer); // run tidy, providing an input and output stream
            return writer.toString()
        }
    
        def soupClean(def html) {
    
            // clean the html
            Document dirty = Jsoup.parseBodyFragment(html);
            Cleaner cleaner = new Cleaner(createWhitelist());
            Document clean = cleaner.clean(dirty);
    
            // now hunt down all style attributes and ensure we only have those that render with iTextPDF
            Elements styledNodes = clean.select("[style]"); // a with href
            styledNodes.each { element ->
                def style = element.attr("style");
                def tag = element.tagName().toLowerCase()
                def newstyle = ""
                CSSOMParser parser = new CSSOMParser();
                InputSource is = new InputSource(new StringReader(style))
                CSSStyleDeclaration styledeclaration = parser.parseStyleDeclaration(is)
                boolean hasProps = false
                for (int i=0; i < styledeclaration.getLength(); i++) {
                    def propname = styledeclaration.item(i)
                    def propval = styledeclaration.getPropertyValue(propname)
                    propval = propval ? propval.trim() : ""
    
                    if (["padding-left", "text-decoration", "text-align", "font-weight", "font-style"].contains(propname)) {
                        newstyle = newstyle + propname + ": " + propval + ";"
                        hasProps = true
                    }
    
                    // standardize table widths, itextPDF won't render tables if there is only width in the
                    // style attribute.  Here we ensure the width is in its own attribute, and change the value so
                    // it is in percentage and no larger than 100% to avoid end users from creating really goofy
                    // tables that they can't edit properly becuase they have made the width too large.
                    //
                    // width of the display area in the editor is about 740px, so let's ensure everything
                    // is relative to that
                    //
                    // TODO could get into trouble with nested tables and widths within as we assume
                    // one table (e.g. could have nested tables both with widths of 500)
                    if (tag.equals("table") && propname.equals("width")) {
                        if (propval.endsWith("%")) {
                            // ensure it is <= 100%
                            propval = propval.replaceAll(~"[^0-9]", "")
                            propval = Math.min(100, propval.toInteger())
                        }
                        else {
                            // else we have measurement in px or assumed px, clean up and
                            // get integer value, then calculate a percentage
                            propval = propval.replaceAll(~"[^0-9]", "")
                            propval = Math.min(100, (int) (propval.toInteger() / 740)*100)
                        } 
                        element.attr("width", propval + "%")
                    }
                }
                if (hasProps) {
                    element.attr("style", newstyle)
                } else {
                    element.removeAttr("style")
                }
    
            }
    
            return clean.body().html();
        }
    
        /**
         * Returns a JSoup whitelist suitable for sane HTML output and iTextPDF 
         */
        def createWhitelist() {
            Whitelist wl = new Whitelist();
    
            // iText supported tags
            wl.addTags(
                "br", "div", "p", "pre", "span", "blockquote", "q", "hr",
                "h1", "h2", "h3", "h4", "h5", "h6",
                "u", "strike", "s", "strong", "sub", "sup", "em", "i", "b", 
                "ul", "ol", "li", "ol",
                "table", "tbody", "td", "tfoot", "th", "thead", "tr", 
                );
    
            // iText attributes recognized which we care about
            // padding-left (div/p/span indentation)
            // text-align (for table right/left align)
            // text-decoration (for span/div/p underline, strikethrough)
            // font-weight (for span/div/p bolder etc)
            // font-style (for span/div/p italic etc)
            // width (for tables)
            // colspan/rowspan (for tables)
    
            ["span", "div", "p", "table", "ul", "ol", "pre", "td", "th"].each { tag ->
                ["style", "padding-left", "text-decoration", "text-align", "font-weight", "font-style"].each { attr ->
                    wl.addAttributes(tag, attr)
                }
            }
    
            ["td", "th"].each { tag ->
                ["colspan", "rowspan", "width"].each { attr ->
                    wl.addAttributes(tag, attr)
                }
            }
            wl.addAttributes("table", "width", "style", "cellpadding")
    
            // img support
            // wl.addAttributes("img", "align", "alt", "height", "src", "title", "width")
    
    
            return wl
        }
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm creating a dynamically generated PDF using FPDF. My PDF requires many exactly horizontal/vertical
Using DTS I'm dynamically creating an access database. After the file is created (which
I am creating a large table dynamically using Javascript. I have realised the time
I'm dynamically loading user controls adding them to the Controls collection of the web
I am creating dynamically a PDF file. After creating I want to open the
I have a web server that is dynamically creating various reports in several formats
I am creating TableRows dynamically. And there are two types of content for these
I am creating HtmlButton dynamically in .cs file. Adding it to Panel using HtmlButton
I am dynamically creating a table which I want to have clickable rows. When
I have an ASPX page that is dynamically creating a gridview from a database

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.