Cleans simple, validating HTML 4/5 into plain text. For simplicity, this class cleans already validating HTML, it does not validate the HTML itself. For example, the stripEntities(String) method removes HTML entities but does not check that the removed entity was valid.
Look at the "See Also" section for useful classes and methods for implementing this class.
- See also
- String::replaceAll(String, String)
-
Pattern::DOTALL
-
Pattern::CASE_INSENSITIVE
-
StringEscapeUtils::unescapeHtml4(String)
- Author
- CS 272 Software Development (University of San Francisco)
- Version
- Spring 2024
| static String edu.usfca.cs272.HtmlCleaner.stripElement |
( |
String | html, |
|
|
String | name ) |
|
static |
Replaces everything between the element tags and the element tags themselves with an empty string. For example, consider the html code:
<style type="text/css">
body { font-size: 10pt; }
</style>
If removing the "style" element, all of the above code will be removed, and replaced with an empty string.
(View this comment as HTML in the Javadoc view.)
- Parameters
-
| html | valid HTML 4 text |
| name | name of the HTML element (like "style" or "script") |
- Returns
- text without that HTML element
- See also
- String::formatted(Object...)
-
String::format(String, Object...)
-
String::replaceAll(String, String)
| static String edu.usfca.cs272.HtmlCleaner.stripEntities |
( |
String | html | ) |
|
|
static |
Replaces all HTML 4 entities with their Unicode character equivalent or, if unrecognized, replaces the entity code with an empty string. Should also work for entities that use decimal syntax like – for the – symbol or – for the – symbol.
For example, 2010–2012 will become 2010–2012 and >‐x will become >x with the unrecognized ‐ entity getting removed. (The
‐ entity is valid HTML 5, but not valid HTML 4.)
(View this comment as HTML in the Javadoc view.)
- See also
- StringEscapeUtils::unescapeHtml4(String)
-
String::replaceAll(String, String)
- Parameters
-
- Returns
- text with all HTML entities converted or removed