Introduction to Strings

[!NOTE] This module explores the core principles of Introduction to Strings, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: Strings are not just Arrays

While a string appears to be a simple sequence of characters, they are one of the most complex “basic” types in modern systems.

Imagine you are building a search engine:

  • Simple Search: Works for English text (ASCII).
  • Global Search: Crashes when users input Hindi, Arabic, or Emojis (🐂).

Data Structures like Strings determine how we represent human language in a digital world of 0s and 1s.


2. The 4 Pillars of String Mastery

Pillar Focus Why it matters
1. Intuition Immutability Understanding why s + "a" is expensive.
2. Rigor Rolling Hashes Mathematical guarantees for fast pattern matching.
3. Hardware Encoding (UTF-8) Knowing the difference between a “Byte” and a “Character”.
4. Patterns Tries & Windows Efficient prefix and substring management.

3. What is a String?

A String is a sequence of characters. In modern languages like Java and Go, strings are treated as first-class citizens but have specific behaviors you must understand to write efficient code.

The Immutability Rule

In both Java and Go, strings are immutable. This means once a string object is created, it cannot be changed.

  • Java: String s = "hello" creates an object. If you do s = s + " world", you aren’t modifying the original “hello”. You are creating a brand new string “hello world”.
  • Go: s := "hello" creates a string header. s += " world" allocates a new string.

[!IMPORTANT] Why Immutability?

  1. Security: Strings are used for parameters like database URLs, passwords, and file paths. Immutability ensures these values can’t be changed maliciously or accidentally.
  2. Thread Safety: Immutable objects are inherently thread-safe.
  3. String Pool: Java maintains a “String Pool” to save memory. If strings were mutable, changing one reference would affect all others pointing to the same literal.

4. Memory Layout

Understanding how strings are stored helps explain their performance characteristics.

Java (java.lang.String)

In modern Java (Compact Strings), a String object contains:

  1. Object Header: Metadata for the JVM.
  2. Coder: A byte flag indicating if the string is Latin-1 (1 byte/char) or UTF-16 (2 bytes/char).
  3. Value: A byte array byte[] holding the data.
  4. Hash: A cached hash code (computed lazily).

Go (string)

A Go string is a lightweight struct (header) containing:

  1. Data: A pointer to the underlying byte array.
  2. Len: The length of the string in bytes.

Because Go strings are just a pointer and a length, passing them around is very cheap (constant time), regardless of the string’s size.

5. Interactive Visualizer: Immutability vs. Builder

When you concatenate immutable strings in a loop, you create “garbage” intermediate objects. Using a Builder (Java StringBuilder, Go strings.Builder) uses a mutable buffer that expands only when needed.

Heap Memory
Objects Created: 0
Total Operations: 0
Choose an operation to see memory impact.

6. Basic Operations & Complexity

Operation Java (java.lang.String) Go (string) Time Complexity Note
Length s.length() len(s) O(1) Stored in header
Access s.charAt(i) s[i] O(1) Direct array access
Concatenation s1 + s2 s1 + s2 O(N+M) Creates new string
Substring s.substring(i, j) s[i:j] O(N) vs O(1) Go slices are O(1)

[!WARNING] Go Substring Gotcha: In Go, s[i:j] creates a slice sharing the same underlying array. This is fast O(1), but if you slice a tiny part of a huge string and keep it in memory, the entire huge array stays in memory (memory leak). Use strings.Clone(s[i:j]) or string([]byte(s[i:j])) to force a copy if needed.

7. ASCII vs Unicode vs UTF-8

  • ASCII: 7-bit (0-127). English only.
  • Unicode: The standard assigning a unique number (Code Point) to every character in every language.
  • UTF-8: The encoding. Variable length (1-4 bytes). ASCII compatible (1 byte).
  • Emoji 🐂 might take 4 bytes.
  • Letter ‘A’ takes 1 byte.

[!WARNING] Trap: s.length() might not mean “number of characters” if you have Emojis. In Java char is UTF-16 (2 bytes), so some Emojis take 2 chars (Surrogate Pair).

8. Code Implementation

Here is how to perform common string operations efficiently.

public class StringOps {
  public static void main(String[] args) {
    // 1. Creation
    String s1 = "Hello";

    // 2. Length (O(1))
    int len = s1.length();

    // 3. Concatenation (Inefficient for loops)
    String s2 = s1 + " World";

    // 4. Builder (Efficient O(N))
    StringBuilder sb = new StringBuilder();
    sb.append("Hello");
    sb.append(" ");
    sb.append("World");
    String result = sb.toString();

    // 5. Access
    char c = s1.charAt(1); // 'e'

    // 6. Substring (O(N) copy)
    String sub = s1.substring(1, 4); // "ell"

    // 7. Comparison
    // NEVER use == for content comparison!
    boolean isEqual = s1.equals("Hello"); // true
  }
}
package main

import (
	"fmt"
	"strings"
)

func main() {
	// 1. Creation
	s1 := "Hello"

	// 2. Length (O(1) bytes)
	// Note: len() gives bytes, not chars (runes)
	l := len(s1)
	fmt.Println("Bytes:", l)

	// 3. Concatenation (Inefficient for loops)
	s2 := s1 + " World"
	fmt.Println(s2)

	// 4. Builder (Efficient O(N))
	var sb strings.Builder
	sb.WriteString("Hello")
	sb.WriteString(" ")
	sb.WriteString("World")
	result := sb.String()
	fmt.Println(result)

	// 5. Access
	b := s1[1] // Byte at index 1 ('e' is 101)

	// 6. Substring (O(1) shared memory)
	sub := s1[1:4] // "ell"

	// 7. Comparison
	isEqual := s1 == "Hello" // true (Go allows ==)
	fmt.Println(isEqual)
}

Next Steps

Now that you know how strings are stored, let’s learn how to find patterns within them efficiently.


9. Deep Dive Strategy Lab: Basic String Operations

Intuition Through Analogy

Think of this chapter like searching text in a large log stream. The goal is not to memorize a fixed trick, but to repeatedly answer:

  1. What is the bottleneck?
  2. Which constraint dominates (time, memory, latency, correctness)?
  3. Which representation makes the bottleneck easier to eliminate?