Class HtmlTool


  • @DefaultKey("htmlTool")
    public class HtmlTool
    extends org.apache.velocity.tools.generic.SafeConfig
    An Apache Velocity tool that provides utility methods to manipulate HTML code using jsoup HTML5 parser.

    The methods utilise CSS selectors to refer to specific elements for manipulation.

    Since:
    1.0
    Author:
    Andrius Velykis, Christophe Friederich
    See Also:
    jsoup HTML parser, jsoup CSS selectors
    • Field Detail

      • DEFAULT_SLUG_SEPARATOR

        public static final String DEFAULT_SLUG_SEPARATOR
        Default separator using to generate slug heading name.
        See Also:
        Constant Field Values
    • Constructor Detail

      • HtmlTool

        public HtmlTool()
    • Method Detail

      • configure

        protected void configure​(org.apache.velocity.tools.generic.ValueParser values)
        Overrides:
        configure in class org.apache.velocity.tools.generic.SafeConfig
        See Also:
        SafeConfig.configure(ValueParser)
      • normaliseWhitespace

        @Nullable
        public String normaliseWhitespace​(@Nullable
                                          String html)
        Normalise the whitespace within this string; multiple spaces collapse to a single, and all whitespace characters (e.g. newline, tab) convert to a simple space
        Parameters:
        html - html content to normalise.
        Returns:
        Returns normalised string.
      • split

        public List<String> split​(@Nonnull
                                  String content,
                                  @Nonnull
                                  String separatorCssSelector)
        Splits the given HTML content into partitions based on the given separator selector. The separators themselves are dropped from the results.
        Parameters:
        content - body HTML content to split (can not be empty or null).
        separatorCssSelector - CSS selector for separators (can not be empty or null).
        Returns:
        a list of HTML partitions split on separator locations, but without the separators.
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • splitOnStarts

        public List<String> splitOnStarts​(@Nonnull
                                          String content,
                                          @Nonnull
                                          String separatorCssSelector)
        Splits the given HTML content into partitions based on the given separator selector. The separators are kept as first elements of the partitions.

        Note that the first part is removed if the split was successful. This is because the first part does not include the separator.

        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators
        Returns:
        a list of HTML partitions split on separator locations (except the first one), with separators at the beginning of each partition
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • split

        public List<String> split​(@Nonnull
                                  String content,
                                  @Nonnull
                                  String separatorCssSelector,
                                  String separatorStrategy)
        Splits the given HTML content into partitions based on the given separator selector. The separators are either dropped or joined with before/after depending on the indicated separator strategy.
        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators
        separatorStrategy - strategy to drop or keep separators, one of "after", "before" or "no"
        Returns:
        a list of HTML partitions split on separator locations.
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • split

        public List<String> split​(@Nonnull
                                  String content,
                                  @Nonnull
                                  String separatorCssSelector,
                                  @Nonnull
                                  HtmlTool.JoinSeparator separatorStrategy)
        Splits the given HTML content into partitions based on the given separator selector.The separators are either dropped or joined with before/after depending on the indicated separator strategy.

        Note that splitting algorithm tries to resolve nested elements so that returned partitions are self-contained HTML elements. The nesting is normally contained within the first applicable partition.

        Parameters:
        content - Body HTML content to split
        separatorCssSelector - CSS selector for separators
        separatorStrategy - strategy to drop or keep separators
        Returns:
        a list of HTML partitions split on separator locations. If no splitting occurs, returns the original content as the single element of the list
        Since:
        1.0
      • reorderToTop

        public String reorderToTop​(String content,
                                   String selector,
                                   int amount)
        Reorders elements in HTML content so that selected elements are found at the top of the content. Can be limited to a certain amount, e.g. to bring just the first of selected elements to the top.
        Parameters:
        content - HTML content to reorder
        selector - CSS selector for elements to bring to top of the content
        amount - Maximum number of elements to reorder
        Returns:
        HTML content with reordered elements, or the original content if no such elements found.
        Since:
        1.0
      • reorderToTop

        public String reorderToTop​(String content,
                                   String selector,
                                   int amount,
                                   String wrapRemaining)
        Reorders elements in HTML content so that selected elements are found at the top of the content. Can be limited to a certain amount, e.g. to bring just the first of selected elements to the top.
        Parameters:
        content - HTML content to reorder
        selector - CSS selector for elements to bring to top of the content
        amount - Maximum number of elements to reorder
        wrapRemaining - HTML to wrap the remaining (non-reordered) part
        Returns:
        HTML content with reordered elements, or the original content if no such elements found.
        Since:
        1.0
      • extract

        @Nonnull
        public HtmlTool.ExtractResult extract​(String content,
                                              String selector,
                                              int amount)
        Extracts HTML elements from the main HTML content. The result consists of the extracted HTML elements and the remainder of HTML content, with these elements removed. Can be limited to a certain amount, e.g. to extract just the first of selected elements.
        Parameters:
        content - HTML content to extract elements from
        selector - CSS selector for elements to extract
        amount - Maximum number of elements to extract
        Returns:
        HTML content of the extracted elements together with the remainder of the original content. If no elements are found, the remainder contains the original content.
        Since:
        1.0
      • setAttr

        public String setAttr​(String content,
                              String selector,
                              String attributeKey,
                              String value)
        Sets attribute to the given value on elements in HTML.
        Parameters:
        content - HTML content to set attributes on
        selector - CSS selector for elements to modify
        attributeKey - Attribute name
        value - Attribute value
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • parse

        public org.jsoup.nodes.Document parse​(@Nonnull
                                              String content)
        Parses body fragment to the <body> element.
        Parameters:
        content - body HTML fragment (can not be null).
        Returns:
        the body element of the parsed content
      • getAttr

        public List<String> getAttr​(String content,
                                    String selector,
                                    String attributeKey)
        Retrieves attribute value on elements in HTML. Will return all attribute values for the selector, since there can be more than one element.
        Parameters:
        content - HTML content to read attributes from
        selector - CSS selector for elements to find
        attributeKey - Attribute name
        Returns:
        Attribute values for all matching elements. If no elements are found, empty list is returned.
        Since:
        1.0
      • addClasses

        @Nonnull
        public String addClasses​(@Nonnull
                                 String baseClass,
                                 @Nonnull
                                 String additionalClasses)
        Adds given class names to a base class name.
        Parameters:
        baseClass - Base class name
        additionalClasses - Additional class names
        Returns:
        Combined class names
      • addClasses

        @Nonnull
        public String addClasses​(@Nonnull
                                 String baseClass,
                                 @Nonnull
                                 String... additionalClasses)
        Adds given class names to a base class name.
        Parameters:
        baseClass - Base class name
        additionalClasses - Additional class names
        Returns:
        Combined class names
      • addClass

        public String addClass​(String content,
                               String selector,
                               List<String> classNames,
                               int amount)
        Adds given class names to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add classes to
        classNames - Names of classes to add to the selected elements
        amount - Maximum number of elements to modify
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • addClass

        public String addClass​(String content,
                               String selector,
                               List<String> classNames)
        Adds given class names to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add classes to
        classNames - Names of classes to add to the selected elements
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • addClass

        public String addClass​(String content,
                               String selector,
                               String className)
        Adds given class to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add the class to
        className - Name of class to add to the selected elements
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • wrap

        public String wrap​(String content,
                           String selector,
                           String wrapHtml,
                           int amount)
        Wraps elements in HTML with the given HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to wrap
        wrapHtml - HTML to use for wrapping the selected elements
        amount - Maximum number of elements to modify
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • remove

        public String remove​(String content,
                             String selector)
        Removes elements from HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to remove
        Returns:
        HTML content with removed elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • replace

        public String replace​(String content,
                              String selector,
                              String replacement)
        Replaces elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to replace
        replacement - HTML replacement (must parse to a single element)
        Returns:
        HTML content with replaced elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • replaceAll

        public String replaceAll​(String content,
                                 Map<String,​String> replacements)
        Replaces elements in HTML.
        Parameters:
        content - HTML content to modify
        replacements - Map of CSS selectors to their replacement HTML texts. CSS selectors find elements to be replaced with the HTML in the mapping. The HTML must parse to a single element.
        Returns:
        HTML content with replaced elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • replaceWith

        public String replaceWith​(String content,
                                  String selector,
                                  String newElement)
        Replaces All elements in HTML corresponding to selector while preserving the content of this element.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to replace
        newElement - HTML replacement (must parse to a single element)
        Returns:
        HTML content with replaced elements. If no elements are found, the original content is returned.
        Since:
        2.0
      • text

        public List<String> text​(@Nullable
                                 String content,
                                 @Nonnull
                                 String selector)
        Retrieves text content of the selected elements in HTML. Renders the element's text as it would be displayed on the web page (including its children).
        Parameters:
        content - HTML content with the elements
        selector - CSS selector for elements to extract contents
        Returns:
        A list of element texts as rendered to display. Empty list if no elements are found.
        Since:
        1.0
      • headingAnchorToId

        public String headingAnchorToId​(String content)
        Transforms the given HTML content by moving anchor (<a name="myheading">) names to IDs for heading elements.

        The anchors are used to indicate positions within a HTML page. In HTML5, however, the name attribute is no longer supported on <a>) tag. The positions within pages are indicated using id attribute instead, e.g. <h1 id="myheading">.

        The method finds anchors inside, immediately before or after the heading tags and uses their name as heading id instead. The anchors themselves are removed.

        Parameters:
        content - HTML content to modify
        Returns:
        HTML content with modified elements. Anchor names are used for adjacent headings, and anchor tags are removed. If no elements are found, the original content is returned.
        Since:
        1.0
      • concat

        public static List<String> concat​(List<String> elements,
                                          String text,
                                          boolean append)
        Utility method to concatenate a String to a list of Strings. The text can be either appended or prepended.
        Parameters:
        elements - list of elements to append/prepend the text to
        text - the given text to append/prepend
        append - if true, text will be appended to the elements. If false, it will be prepended
        Returns:
        list of elements with the text appended/prepended
        Since:
        1.0
      • ensureHeadingIds

        public String ensureHeadingIds​(String pageType,
                                       String currentPage,
                                       String content,
                                       String idSeparator)
        Transforms the given HTML content by adding IDs to all heading elements (h1-6) that do not have one.

        IDs on heading elements are used to indicate positions within a HTML page in HTML5. If a heading tag without an id is found, its "slug" is generated automatically based on the heading contents and used as the ID.

        Note that the algorithm also modifies existing IDs that have symbols not allowed in CSS selectors, e.g. ":", ".", etc. The symbols are removed.

        Parameters:
        pageType - The type of page.
        currentPage - The name of current page.
        content - HTML content to modify.
        idSeparator - the seperator used to slug ID.
        Returns:
        Returns a String representing HTML content with all heading elements having id attributes. If all headings were with IDs already, the original content is returned.
        Since:
        1.0
      • fixTableHeads

        public String fixTableHeads​(String content)
        Fixes table heads: wraps rows with <th> (table heading) elements into <thead> element if they are currently in <tbody>.
        Parameters:
        content - HTML content to modify
        Returns:
        HTML content with all table heads fixed. If all heads were correct, the original content is returned.
        Since:
        1.0
      • slug

        public static String slug​(String input)
        Creates a slug (latin text with no whitespace or other symbols) for a longer text (i.e. to use in URLs). Uses "-" as a whitespace separator.
        Parameters:
        input - text to generate the slug from
        Returns:
        the slug of the given text that contains alphanumeric symbols and "-" only
        Since:
        1.0
      • headingTree

        public List<? extends HtmlTool.IdElement> headingTree​(String content,
                                                              List<String> sections)
        Reads all headings in the given HTML content as a hierarchy. Subsequent smaller headings are nested within bigger ones, e.g. <h2> is nested under preceding <h1>.

        Only headings with IDs are included in the hierarchy. The result elements contain ID and heading text for each heading. The hierarchy is useful to generate a Table of Contents for a page.

        Parameters:
        content - HTML content to extract heading hierarchy from
        sections - list of all sections
        Returns:
        a list of top-level heading items (with id and text). The remaining headings are nested within these top-level items. Empty list if no headings are in the content.
        Since:
        1.0