Static Public Member Functions
static String	stripTags (String html)

static String	stripEntities (String html)

static String	stripComments (String html)

static String	stripElement (String html, String name)

static String	stripBlockElements (String html)

static String	stripHtml (String html)

Detailed Description

Cleans simple, validating HTML 4/5 into plain text. For simplicity, this class cleans already validating HTML, it does not validate the HTML itself. For example, the stripEntities(String) method removes HTML entities but does not check that the removed entity was valid.

Look at the "See Also" section for useful classes and methods for implementing this class.

See also: String::replaceAll(String, String); Pattern::DOTALL; Pattern::CASE_INSENSITIVE; StringEscapeUtils::unescapeHtml4(String)

Author: CS 272 Software Development (University of San Francisco)

Version: Spring 2024

Member Function Documentation

◆ stripBlockElements()

static String edu.usfca.cs272.HtmlCleaner.stripBlockElements ( String html )

static

A simple (but less efficient) approach for removing comments and certain block elements from the provided html. The block elements removed include: head, style, script, noscript, iframe, and svg.

Parameters

html	valid HTML 4 text

Returns: text clean of any comments and certain HTML block elements

◆ stripComments()

static String edu.usfca.cs272.HtmlCleaner.stripComments ( String html )

static

Replaces all HTML comments with an empty string. For example:

A<!-- B -->C

...and this HTML:

A<!--
B -->C

...will both become "AC" after stripping comments.

(View this comment as HTML in the Javadoc view.)

Parameters

html	valid HTML 4 text

Returns: text without any HTML comments

See also: String::replaceAll(String, String)

◆ stripElement()

static String edu.usfca.cs272.HtmlCleaner.stripElement	(	String	html,
		String	name )

static

Replaces everything between the element tags and the element tags themselves with an empty string. For example, consider the html code:

<style type="text/css">
  body { font-size: 10pt; }
</style>

If removing the "style" element, all of the above code will be removed, and replaced with an empty string.

(View this comment as HTML in the Javadoc view.)

Parameters

html	valid HTML 4 text
name	name of the HTML element (like "style" or "script")

Returns: text without that HTML element

See also: String::formatted(Object...); String::format(String, Object...); String::replaceAll(String, String)

◆ stripEntities()

static String edu.usfca.cs272.HtmlCleaner.stripEntities ( String html )

static

Replaces all HTML 4 entities with their Unicode character equivalent or, if unrecognized, replaces the entity code with an empty string. Should also work for entities that use decimal syntax like – for the – symbol or – for the – symbol.

For example, 2010–2012 will become 2010–2012 and >&dash;x will become >x with the unrecognized &dash; entity getting removed. (The &dash; entity is valid HTML 5, but not valid HTML 4.)

(View this comment as HTML in the Javadoc view.)

See also: StringEscapeUtils::unescapeHtml4(String); String::replaceAll(String, String)

Parameters

html	valid HTML 4 text

Returns: text with all HTML entities converted or removed

◆ stripHtml()

static String edu.usfca.cs272.HtmlCleaner.stripHtml ( String html )

static

Removes all HTML tags and certain block elements from the provided text.

See also: #stripBlockElements(String); #stripTags(String)

Parameters

html	valid HTML 4 text

Returns: text clean of any HTML tags and certain block elements

◆ stripTags()

static String edu.usfca.cs272.HtmlCleaner.stripTags ( String html )

static

Replaces all HTML tags with an empty string. For example, the html A<b>B</b>C will become ABC.

(View this comment as HTML in the Javadoc view.)

Parameters

html	valid HTML 4 text

Returns: text without any HTML tags

See also: String::replaceAll(String, String)

The documentation for this class was generated from the following file:

C:/USFCS/CS272/projects/project-Ravneetsb/src/main/java/edu/usfca/cs272/HtmlCleaner.java

Static Public Member Functions

Detailed Description

Member Function Documentation

◆ stripBlockElements()

◆ stripComments()

◆ stripElement()

◆ stripEntities()

◆ stripHtml()

◆ stripTags()