VIBE v5.1.6
Search Engine
Loading...
Searching...
No Matches
edu.usfca.cs272.HtmlCleaner Class Reference

Static Public Member Functions

static String stripTags (String html)
 
static String stripEntities (String html)
 
static String stripComments (String html)
 
static String stripElement (String html, String name)
 
static String stripBlockElements (String html)
 
static String stripHtml (String html)
 

Detailed Description

Cleans simple, validating HTML 4/5 into plain text. For simplicity, this class cleans already validating HTML, it does not validate the HTML itself. For example, the stripEntities(String) method removes HTML entities but does not check that the removed entity was valid.

Look at the "See Also" section for useful classes and methods for implementing this class.

See also
String::replaceAll(String, String)
Pattern::DOTALL
Pattern::CASE_INSENSITIVE
StringEscapeUtils::unescapeHtml4(String)
Author
CS 272 Software Development (University of San Francisco)
Version
Spring 2024

Member Function Documentation

◆ stripBlockElements()

static String edu.usfca.cs272.HtmlCleaner.stripBlockElements ( String html)
static

A simple (but less efficient) approach for removing comments and certain block elements from the provided html. The block elements removed include: head, style, script, noscript, iframe, and svg.

Parameters
htmlvalid HTML 4 text
Returns
text clean of any comments and certain HTML block elements

◆ stripComments()

static String edu.usfca.cs272.HtmlCleaner.stripComments ( String html)
static

Replaces all HTML comments with an empty string. For example:

A<!-- B -->C

...and this HTML:

A<!--
B -->C

...will both become "AC" after stripping comments.

(View this comment as HTML in the Javadoc view.)

Parameters
htmlvalid HTML 4 text
Returns
text without any HTML comments
See also
String::replaceAll(String, String)

◆ stripElement()

static String edu.usfca.cs272.HtmlCleaner.stripElement ( String html,
String name )
static

Replaces everything between the element tags and the element tags themselves with an empty string. For example, consider the html code:

<style type="text/css">
  body { font-size: 10pt; }
</style>

If removing the "style" element, all of the above code will be removed, and replaced with an empty string.

(View this comment as HTML in the Javadoc view.)

Parameters
htmlvalid HTML 4 text
namename of the HTML element (like "style" or "script")
Returns
text without that HTML element
See also
String::formatted(Object...)
String::format(String, Object...)
String::replaceAll(String, String)

◆ stripEntities()

static String edu.usfca.cs272.HtmlCleaner.stripEntities ( String html)
static

Replaces all HTML 4 entities with their Unicode character equivalent or, if unrecognized, replaces the entity code with an empty string. Should also work for entities that use decimal syntax like &#8211; for the &#8211; symbol or &#x2013; for the &#x2013; symbol.

For example, 2010&ndash;2012 will become 2010–2012 and &gt;&dash;x will become >x with the unrecognized &dash; entity getting removed. (The
&dash;
entity is valid HTML 5, but not valid HTML 4.)

(View this comment as HTML in the Javadoc view.)

See also
StringEscapeUtils::unescapeHtml4(String)
String::replaceAll(String, String)
Parameters
htmlvalid HTML 4 text
Returns
text with all HTML entities converted or removed

◆ stripHtml()

static String edu.usfca.cs272.HtmlCleaner.stripHtml ( String html)
static

Removes all HTML tags and certain block elements from the provided text.

See also
#stripBlockElements(String)
#stripTags(String)
Parameters
htmlvalid HTML 4 text
Returns
text clean of any HTML tags and certain block elements

◆ stripTags()

static String edu.usfca.cs272.HtmlCleaner.stripTags ( String html)
static

Replaces all HTML tags with an empty string. For example, the html A<b>B</b>C will become ABC.

(View this comment as HTML in the Javadoc view.)

Parameters
htmlvalid HTML 4 text
Returns
text without any HTML tags
See also
String::replaceAll(String, String)

The documentation for this class was generated from the following file: