String Compression

String compression reduces the size of a string without losing information (lossless). One of the simplest forms of compression is Run-Length Encoding (RLE).

1. Run-Length Encoding (RLE)

RLE replaces consecutive data elements (runs) with a single data value and count.

  • Input: AAABBBCCDAA
  • Output: A3B3C2D1A2

[!NOTE] RLE is effective only when the data contains many consecutive repeating characters. If the data is random (e.g., ABCDE), RLE might actually increase the size (A1B1C1D1E1).

Visualization: Compressing a String

Watch how the RLE algorithm scans the string and collapses runs of characters.

⬇️

2. Problem: String Compression (In-Place)

Given an array of characters chars, compress it using the following algorithm:

  1. Begin with an empty string s.
  2. For each group of consecutive repeating characters in chars:
    • If the group’s length is 1, append the character to s.
    • Otherwise, append the character followed by the group’s length.
  3. The compressed string s should not be returned separately, but instead, be stored in the input character array chars. Note that group lengths that are 10 or longer will split into multiple characters in chars.

Example:

  • Input: chars = ["a","a","b","b","c","c","c"]
  • Output: Return 6, and the first 6 characters of input array should be: ["a","2","b","2","c","3"]

Implementation

public int compress(char[] chars) {
  int i = 0; // Read pointer
  int ans = 0; // Write pointer

  while (i < chars.length) {
    char groupChar = chars[i];
    int count = 0;

    // Count consecutive characters
    while (i < chars.length && chars[i] == groupChar) {
      i++;
      count++;
    }

    // Write character
    chars[ans++] = groupChar;

    // Write count (if > 1)
    if (count > 1) {
      for (char c : Integer.toString(count).toCharArray()) {
        chars[ans++] = c;
      }
    }
  }
  return ans;
}

Complexity Analysis

  • Time Complexity: O(n), where n is the length of chars. We process each character once.
  • Space Complexity: O(1). We modify the array in-place.

3. Other Compression Techniques

Huffman Coding

Huffman coding is a variable-length optimal prefix code. It assigns shorter binary codes to more frequent characters and longer codes to less frequent characters. This is the basis for formats like ZIP and JPEG.

  • Frequency Analysis: Build a frequency map.
  • Priority Queue: Build a tree where frequent characters are near the root (short paths).
  • Encoding: Traverse tree to generate bits.

Practice Problems

  1. String Compression (Medium)
  2. Count and Say (Medium)
  3. Encode and Decode Strings (Medium)

4. Deep Dive Strategy Lab: String Compression

Intuition Through Analogy

Think of this chapter like searching text in a large log stream. The goal is not to memorize a fixed trick, but to repeatedly answer:

  1. What is the bottleneck?
  2. Which constraint dominates (time, memory, latency, correctness)?
  3. Which representation makes the bottleneck easier to eliminate?