Static Public Member Functions
static String	clean (String text)

static String[]	split (String text)

static String[]	parse (String text)

static void	addStems (String line, Stemmer stemmer, Collection<? super String > stems)

static ArrayList< String >	listStems (String line, Stemmer stemmer)

static ArrayList< String >	listStems (String line)

static ArrayList< String >	listStems (Path input) throws IOException

static TreeSet< String >	uniqueStems (String line, Stemmer stemmer)

static TreeSet< String >	uniqueStems (String line)

static TreeSet< String >	uniqueStems (Path input) throws IOException

static ArrayList< TreeSet< String > >	listUniqueStems (Path input) throws IOException

Static Public Attributes
static final Pattern	SPLIT_REGEX = Pattern.compile("(?U)\\s+")

static final Pattern	CLEAN_REGEX = Pattern.compile("(?U)[^\\p{Alpha}\\s]+")

Detailed Description

Utility class for parsing, cleaning, and stemming text and text files into collections of processed words.

Author: CS 272 Software Development (University of San Francisco); Ravneet Singh Bhatia

Version: Spring 2024

Member Function Documentation

◆ addStems()

static void edu.usfca.cs272.FileStemmer.addStems	(	String	line,
		Stemmer	stemmer,
		Collection<? super String >	stems )

static

Parses the line into cleaned and stemmed words and adds them to the provided collection.

Parameters

line	the line of words to clean, split, and stem
stemmer	the stemmer to use
stems	the collection to add stems

See also: #parse(String); Stemmer::stem(CharSequence); Collection::add(Object)

◆ clean()

static String edu.usfca.cs272.FileStemmer.clean ( String text )

static

Cleans the text by removing any non-alphabetic characters (e.g. non-letters like digits, punctuation, symbols, and diacritical marks like the umlaut) and converting the remaining characters to lowercase.

Parameters

text	the text to clean

Returns: cleaned text

◆ listStems() [1/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems ( Path input ) throws IOException

static

Reads a file line by line, parses each line into cleaned and stemmed words using the default stemmer for English.

Parameters

input the input file to parse and stem

Returns: a list of stems from file in parsed order

Exceptions

IOException if unable to read or parse file

See also: SnowballStemmer; ALGORITHM::ENGLISH; StandardCharsets::UTF_8; #listStems(String, Stemmer)

◆ listStems() [2/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems ( String line )

static

Parses the line into a list of cleaned and stemmed words using the default stemmer for English.

Parameters

line	the line of words to parse and stem

Returns: a list of cleaned and stemmed words in parsed order

See also: SnowballStemmer::SnowballStemmer(ALGORITHM); ALGORITHM::ENGLISH; #listStems(String, Stemmer)

◆ listStems() [3/3]

static ArrayList< String > edu.usfca.cs272.FileStemmer.listStems	(	String	line,
		Stemmer	stemmer )

static

Parses the line into a list of cleaned and stemmed words.

Parameters

line	the line of words to clean, split, and stem
stemmer	the stemmer to use

Returns: a list of cleaned and stemmed words in parsed order

See also: #parse(String); Stemmer::stem(CharSequence); #addStems(String, Stemmer, Collection)

◆ listUniqueStems()

static ArrayList< TreeSet< String > > edu.usfca.cs272.FileStemmer.listUniqueStems ( Path input ) throws IOException

static

Reads a file line by line, parses each line into unique, sorted, cleaned, and stemmed words using the default stemmer for English, and adds the set of unique sorted stems to a list per line in the file.

Parameters

input the input file to parse and stem

Returns: a list where each item is the sets of unique sorted stems parsed from a single line of the input file

See also: SnowballStemmer; ALGORITHM::ENGLISH; StandardCharsets::UTF_8; #uniqueStems(String, Stemmer)

Exceptions

IOException if the path is null or invalid.

◆ parse()

static String[] edu.usfca.cs272.FileStemmer.parse ( String text )

static

Parses the text into an array of clean words.

Parameters

text	the text to clean and split

Returns: an array of String objects

See also: #clean(String); #parse(String)

◆ split()

static String[] edu.usfca.cs272.FileStemmer.split ( String text )

static

Splits the supplied text by whitespaces.

Parameters

text	the text to split

Returns: an array of String objects

◆ uniqueStems() [1/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems ( Path input ) throws IOException

static

Reads a file line by line, parses each line into a set of unique, sorted, cleaned, and stemmed words using the default stemmer for English.

Parameters

input the input file to parse and stem

Returns: a sorted set of unique cleaned and stemmed words from file

Exceptions

IOException if unable to read or parse file

See also: SnowballStemmer; ALGORITHM::ENGLISH; StandardCharsets::UTF_8; #uniqueStems(String, Stemmer)

◆ uniqueStems() [2/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems ( String line )

static

Parses the line into a set of unique, sorted, cleaned, and stemmed words using the default stemmer for English.

Parameters

line	the line of words to parse and stem

Returns: a sorted set of unique cleaned and stemmed words

See also: SnowballStemmer::SnowballStemmer(ALGORITHM); ALGORITHM::ENGLISH; #uniqueStems(String, Stemmer)

◆ uniqueStems() [3/3]

static TreeSet< String > edu.usfca.cs272.FileStemmer.uniqueStems	(	String	line,
		Stemmer	stemmer )

static

Parses the line into a set of unique, sorted, cleaned, and stemmed words.

Parameters

line	the line of words to parse and stem
stemmer	the stemmer to use

Returns: a sorted set of unique cleaned and stemmed words

See also: #parse(String); Stemmer::stem(CharSequence); #addStems(String, Stemmer, Collection)

Member Data Documentation

◆ CLEAN_REGEX

final Pattern edu.usfca.cs272.FileStemmer.CLEAN_REGEX = Pattern.compile("(?U)[^\\p{Alpha}\\s]+")

static

Regular expression that matches non-alphabetic characters. *

◆ SPLIT_REGEX

final Pattern edu.usfca.cs272.FileStemmer.SPLIT_REGEX = Pattern.compile("(?U)\\s+")

static

Regular expression that matches any whitespace. *

The documentation for this class was generated from the following file:

C:/USFCS/CS272/projects/project-Ravneetsb/src/main/java/edu/usfca/cs272/FileStemmer.java

Static Public Member Functions

Static Public Attributes

Detailed Description

Member Function Documentation

◆ addStems()

◆ clean()

◆ listStems() [1/3]

◆ listStems() [2/3]

◆ listStems() [3/3]

◆ listUniqueStems()

◆ parse()

◆ split()

◆ uniqueStems() [1/3]

◆ uniqueStems() [2/3]

◆ uniqueStems() [3/3]

Member Data Documentation

◆ CLEAN_REGEX

◆ SPLIT_REGEX