Building a Word Frequency Counter in Java: A Step-by-Step Guide

Introduction

The "Word Frequency Counter" is a Java project that allows you to analyze a given text and determine the frequency of each word it contains. This project serves as an excellent introduction to Java programming concepts, including string manipulation, arrays, and basic data structures.

Problem Statement

When working with textual data, understanding word frequency can be crucial for various applications, from text analysis to content optimization. The problem we aim to solve in this project is to create a Java application that takes a text input, analyzes it, and displays the frequency of each unique word present in the text.

Algorithm

The Word Frequency Counter project involves several key steps:

  1. User Input: We prompt the user to input a text document. The user can enter multiple lines of text, and we concatenate them into a single string.

  2. Tokenization: We tokenize the input text into words and remove punctuation. This step involves regular expression manipulation to split the text into individual words.

  3. Word Frequency Count: We count the occurrences of each unique word in the input text. We maintain two arrays, one for unique words and another for word counts.

  4. Sorting: To display the word frequencies in descending order, we implement a basic sorting algorithm. We iterate through the arrays, comparing word counts and swapping elements as needed.

  5. Display: Finally, we display the word frequencies, showing each word and its count in descending order.

Pseudocode

Initialize an empty string inputText
While user input is not 'exit':
    Read a line of text from the user and append it to inputText

Tokenize inputText into an array of words and remove punctuation
Initialize arrays: uniqueWords, wordCounts, uniqueWordCount = 0

For each word in words:
    If word is not in uniqueWords:
        Add word to uniqueWords
        Initialize wordCounts[uniqueWordCount] to 1
        Increment uniqueWordCount
    Else:
        Increment the corresponding word count in wordCounts

Sort uniqueWords and wordCounts in descending order based on wordCounts

Display the sorted word frequencies

Implemented Code

import java.util.Scanner;

public class WordFrequencyCounter {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        System.out.println("Word Frequency Counter");
        System.out.println("Enter text (or type 'exit' to quit):");

        String inputText = "";
        while (true) {
            String line = scanner.nextLine();
            if (line.equalsIgnoreCase("exit")) {
                break;
            }
            inputText += line + " ";
        }

        if (inputText.trim().isEmpty()) {
            System.out.println("No input provided. Exiting...");
            return;
        }

        // Tokenize the input text into words and remove punctuation
        String[] words = inputText.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");

        // Count word frequencies
        String[] uniqueWords = new String[words.length];
        int[] wordCounts = new int[words.length];
        int uniqueWordCount = 0;

        for (String word : words) {
            boolean isUnique = true;

            for (int i = 0; i < uniqueWordCount; i++) {
                if (uniqueWords[i].equals(word)) {
                    wordCounts[i]++;
                    isUnique = false;
                    break;
                }
            }

            if (isUnique) {
                uniqueWords[uniqueWordCount] = word;
                wordCounts[uniqueWordCount] = 1;
                uniqueWordCount++;
            }
        }

        // Sort word frequencies in descending order
        for (int i = 0; i < uniqueWordCount - 1; i++) {
            for (int j = i + 1; j < uniqueWordCount; j++) {
                if (wordCounts[i] < wordCounts[j]) {
                    // Swap word frequencies
                    int tempCount = wordCounts[i];
                    wordCounts[i] = wordCounts[j];
                    wordCounts[j] = tempCount;

                    // Swap corresponding words
                    String tempWord = uniqueWords[i];
                    uniqueWords[i] = uniqueWords[j];
                    uniqueWords[j] = tempWord;
                }
            }
        }

        // Display word frequencies
        System.out.println("\nFrequency of each word:");
        for (int i = 0; i < uniqueWordCount; i++) {
            System.out.println("- " + uniqueWords[i] + ": " + wordCounts[i]);
        }
    }
}

Dry Run

To help you understand how the program works, let's perform a dry run with a sample input:

Input

This is a sample text. It contains several words. This text is used for testing the Word Frequency Counter.
  • Preprocessing: The input text is converted to lowercase and punctuation is removed.

  •     this is a sample text it contains several words this text is used for testing the word frequency counter
    
  • Tokenization: The preprocessed text is split into individual words (tokens).

  •     ['this', 'is', 'a', 'sample', 'text', 'it', 'contains', 'several', 'words', 'this', 'text', 'is', 'used', 'for', 'testing', 'the', 'word', 'frequency', 'counter']
    
  • Counting Word Frequencies: The program counts the frequency of each word.

  •     {'this': 2, 'is': 2, 'a': 1, 'sample': 1, 'text': 2, 'it': 1, 'contains': 1, 'several': 1, 'words': 1, 'used': 1, 'for': 1, 'testing': 1, 'the': 1, 'word': 1, 'frequency': 1, 'counter': 1}
    
  • Displaying the Results: The top 10 most frequent words are displayed.

  •     this: 2
        is: 2
        text: 2
        a: 1
        sample: 1
        it: 1
        contains: 1
        several: 1
        words: 1
        used: 1
    

Conclusion

The Word Frequency Counter project demonstrates how to analyze text using Java programming fundamentals. It offers valuable insights into string manipulation, arrays, and basic data structures. You can further enhance this project by adding features like excluding common words or analyzing larger text datasets.

By following the algorithm and pseudocode, you can adapt and expand this project to suit your needs, making it a versatile tool for text analysis and exploration in the Java programming world.