VIBE v5.1.6
Search Engine
Loading...
Searching...
No Matches
edu.usfca.cs272.FileStemmer Class Reference

Static Public Member Functions

static String clean (String text)
 
static String[] split (String text)
 
static String[] parse (String text)
 
static void addStems (String line, Stemmer stemmer, Collection<? super String > stems)
 
static ArrayList< String > listStems (String line, Stemmer stemmer)
 
static ArrayList< String > listStems (String line)
 
static ArrayList< String > listStems (Path input) throws IOException
 
static TreeSet< String > uniqueStems (String line, Stemmer stemmer)
 
static TreeSet< String > uniqueStems (String line)
 
static TreeSet< String > uniqueStems (Path input) throws IOException
 
static ArrayList< TreeSet< String > > listUniqueStems (Path input) throws IOException
 

Static Public Attributes

static final Pattern SPLIT_REGEX = Pattern.compile("(?U)\\s+")
 
static final Pattern CLEAN_REGEX = Pattern.compile("(?U)[^\\p{Alpha}\\s]+")
 

Detailed Description

Utility class for parsing, cleaning, and stemming text and text files into collections of processed words.

Author
CS 272 Software Development (University of San Francisco)
Ravneet Singh Bhatia
Version
Spring 2024

Member Function Documentation

◆ addStems()

static void edu.usfca.cs272.FileStemmer.addStems ( String line,
Stemmer stemmer,
Collection<? super String > stems )
static

Parses the line into cleaned and stemmed words and adds them to the provided collection.

Parameters
linethe line of words to clean, split, and stem
stemmerthe stemmer to use
stemsthe collection to add stems
See also
#parse(String)
Stemmer::stem(CharSequence)
Collection::add(Object)

◆ clean()

static String edu.usfca.cs272.FileStemmer.clean ( String text)
static

Cleans the text by removing any non-alphabetic characters (e.g. non-letters like digits, punctuation, symbols, and diacritical marks like the umlaut) and converting the remaining characters to lowercase.

Parameters
textthe text to clean
Returns
cleaned text

◆ listStems() [1/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems ( Path input) throws IOException
static

Reads a file line by line, parses each line into cleaned and stemmed words using the default stemmer for English.

Parameters
inputthe input file to parse and stem
Returns
a list of stems from file in parsed order
Exceptions
IOExceptionif unable to read or parse file
See also
SnowballStemmer
ALGORITHM::ENGLISH
StandardCharsets::UTF_8
#listStems(String, Stemmer)

◆ listStems() [2/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems ( String line)
static

Parses the line into a list of cleaned and stemmed words using the default stemmer for English.

Parameters
linethe line of words to parse and stem
Returns
a list of cleaned and stemmed words in parsed order
See also
SnowballStemmer::SnowballStemmer(ALGORITHM)
ALGORITHM::ENGLISH
#listStems(String, Stemmer)

◆ listStems() [3/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems ( String line,
Stemmer stemmer )
static

Parses the line into a list of cleaned and stemmed words.

Parameters
linethe line of words to clean, split, and stem
stemmerthe stemmer to use
Returns
a list of cleaned and stemmed words in parsed order
See also
#parse(String)
Stemmer::stem(CharSequence)
#addStems(String, Stemmer, Collection)

◆ listUniqueStems()

static ArrayList< TreeSet< String > > edu.usfca.cs272.FileStemmer.listUniqueStems ( Path input) throws IOException
static

Reads a file line by line, parses each line into unique, sorted, cleaned, and stemmed words using the default stemmer for English, and adds the set of unique sorted stems to a list per line in the file.

Parameters
inputthe input file to parse and stem
Returns
a list where each item is the sets of unique sorted stems parsed from a single line of the input file
See also
SnowballStemmer
ALGORITHM::ENGLISH
StandardCharsets::UTF_8
#uniqueStems(String, Stemmer)
Exceptions
IOExceptionif the path is null or invalid.

◆ parse()

static String[] edu.usfca.cs272.FileStemmer.parse ( String text)
static

Parses the text into an array of clean words.

Parameters
textthe text to clean and split
Returns
an array of String objects
See also
#clean(String)
#parse(String)

◆ split()

static String[] edu.usfca.cs272.FileStemmer.split ( String text)
static

Splits the supplied text by whitespaces.

Parameters
textthe text to split
Returns
an array of String objects

◆ uniqueStems() [1/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems ( Path input) throws IOException
static

Reads a file line by line, parses each line into a set of unique, sorted, cleaned, and stemmed words using the default stemmer for English.

Parameters
inputthe input file to parse and stem
Returns
a sorted set of unique cleaned and stemmed words from file
Exceptions
IOExceptionif unable to read or parse file
See also
SnowballStemmer
ALGORITHM::ENGLISH
StandardCharsets::UTF_8
#uniqueStems(String, Stemmer)

◆ uniqueStems() [2/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems ( String line)
static

Parses the line into a set of unique, sorted, cleaned, and stemmed words using the default stemmer for English.

Parameters
linethe line of words to parse and stem
Returns
a sorted set of unique cleaned and stemmed words
See also
SnowballStemmer::SnowballStemmer(ALGORITHM)
ALGORITHM::ENGLISH
#uniqueStems(String, Stemmer)

◆ uniqueStems() [3/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems ( String line,
Stemmer stemmer )
static

Parses the line into a set of unique, sorted, cleaned, and stemmed words.

Parameters
linethe line of words to parse and stem
stemmerthe stemmer to use
Returns
a sorted set of unique cleaned and stemmed words
See also
#parse(String)
Stemmer::stem(CharSequence)
#addStems(String, Stemmer, Collection)

Member Data Documentation

◆ CLEAN_REGEX

final Pattern edu.usfca.cs272.FileStemmer.CLEAN_REGEX = Pattern.compile("(?U)[^\\p{Alpha}\\s]+")
static

Regular expression that matches non-alphabetic characters. *

◆ SPLIT_REGEX

final Pattern edu.usfca.cs272.FileStemmer.SPLIT_REGEX = Pattern.compile("(?U)\\s+")
static

Regular expression that matches any whitespace. *


The documentation for this class was generated from the following file: