How a Connect Four Solver Works with Meta Glasses

Real-time computer vision, a perfect solver, and some duct tape holding it together on Meta glasses.

The full pipeline: camera frames flow through marker detection, perspective correction, cell classification, stability voting, and the game solver before reaching the user as audio and a visual overlay.

This is a Connect Four app that watches you play through the glasses' camera, detects the board state, and tells you the perfect move in real-time. It knows who wins, how many moves until they win, and which column to play next. There's also stubbed support for Meta Ray-Ban glasses, built speculatively against what their display SDK will likely look like based on their existing Wearables SDK and web platform patterns.

I built it because I was curious whether you could stitch together real-time computer vision, a TFLite classifier, and an exhaustive game solver on mobile hardware and have it actually work at interactive speed. This write-up explains how each piece works.

Seeing the Board

The first problem is the most physical one: the app needs to find a Connect Four board in a live camera feed, figure out its exact boundaries, and correct for whatever angle the camera is looking from. A board viewed from the side is foreshortened, rotated, and perspective-distorted. The app needs a clean, top-down view to classify cells reliably.

AprilTag markers

Four small printed markers are placed at the corners of the board, just outside the playable grid. These are AprilTag 36h11 markers, a family of fiducial markers designed for robust detection in noisy, partially occluded, or off-angle conditions.

The "36h11" part matters. It means each marker encodes 36 bits with a minimum Hamming distance of 11 between any two valid IDs. That lets the detector correct up to 5 bit errors per marker. The previous version used ArUco 4x4 markers (Hamming distance 3, corrects 1 error), which fell apart at oblique viewing angles where foreshortening compresses the marker's dynamic range.

Four AprilTag markers placed outside the grid. The board appears foreshortened from the camera's viewing angle.

The detector runs OpenCV's contour-based ArUco pipeline (not the original AprilTag gradient algorithm) with several parameters tuned for oblique-angle robustness:

14 pixels per cell for perspective de-warp (default is 4). Higher resolution catches bits that would smear when the marker is foreshortened.
0.25 ignored border margin (default 0.13). Wider margin prevents the marker's border from bleeding into the data bits at steep angles.
0.6 border error rate (default 0.35). Foreshortened markers compress the border's contrast range.
3.0 minimum Otsu standard deviation (default 5.0). Keeps adaptive binarization active even under uneven lighting, where far-side markers have lower contrast.
4-step adaptive threshold window (default 10). Produces 6 threshold windows from size 3 to 23 instead of 3, covering more lighting conditions.

When a marker ID appears more than once in a frame (board features sometimes false-positive as markers at oblique angles), the detection with the largest area wins. The real physical marker is always the largest candidate.

The 3-marker fallback

Sometimes a hand or arm occludes one of the four markers. If exactly three are visible and the camera's intrinsic matrix is available (extracted from the frame's metadata), the app recovers the camera pose from the three known world-to-image correspondences using solvePnP (SQPNP algorithm), then projects the missing corner into image space. The result is validated for convexity before use.

This only works when intrinsics are available. The iPhone provides them; the stubbed glasses path does not. On glasses, the design degrades gracefully to "wait for all four markers."

Perspective warp

With four marker centers identified in the camera frame, the app computes a perspective transform (homography) that maps those points to known canonical positions in a 720×560 pixel "warp space." The dimensions come from the physical board proportions:

7 columns × 80 pixels = 560px, plus 80px margin on each side = 720px wide
6 rows × 80 pixels = 480px, plus 40px margin top and bottom = 560px tall
Markers sit 0.5 cell-widths outside the grid horizontally

OpenCV's getPerspectiveTransform() computes the 3×3 homography matrix, and warpPerspective() applies it with bilinear interpolation to produce a clean top-down image. A user-adjustable calibration offset lets you drag the grid corners to fine-tune alignment. These offsets are applied to the warp destinations before computing the homography, stored in cell-fraction units so they survive resolution changes.

Perspective correction transforms the skewed camera view into a clean 720×560 top-down image aligned to a fixed grid.

Cell extraction

Once the image is warped, extracting cells is just geometry. Each cell is 80×80 pixels, but a 10% inset on each side (8 pixels) avoids the grid lines, leaving a 64×64 crop. The entire warped image is converted from BGR to HSV before extraction. HSV separates color (hue) from lighting (value), which makes the classifier's job easier under varying illumination.

All 42 cells (7 columns × 6 rows) are extracted in column-major order and packed into a single contiguous buffer: 42 × 64 × 64 × 3 channels = about 500KB of HSV data, ready for the classifier.

Each 80×80 pixel cell is inset by 10% (8 pixels) on each side, producing a 64×64 crop that avoids the grid lines at cell boundaries.

Blur detection

Before classification, the warped image is checked for blur using the Laplacian variance method: compute the Laplacian (second derivative) of the grayscale image, then take the variance of that result. Sharp images have high variance (lots of edge detail), blurry images have low variance (edges smeared). The threshold is 35.0, though this gate is currently disabled because it was starving the stability validator near endgame positions where the board is nearly full and high-frequency detail is naturally lower.

Classifying the Cells

With 42 clean 64×64 HSV crops in hand, the next step is figuring out what's in each cell: empty, red, or yellow. This is a small classification problem, but it needs to be fast (sub-millisecond for all 42 cells) and robust across different lighting conditions, board colors, and camera sensors.

The classifier

A compact TensorFlow Lite convolutional neural network takes all 42 cells as a single batched input (shape [42, 64, 64, 3]) and outputs 3 logits per cell (empty, red, yellow). The model is about 100KB. Inference runs on a single thread; at this model size, the overhead of dispatching to multiple threads exceeds the computation itself.

Each channel is normalized by dividing by 255.0. OpenCV's HSV hue range is [0, 180], not [0, 255], but the model was trained with this exact normalization scheme, so it expects the slightly compressed hue range. Matching the training pipeline exactly is more important than "correct" normalization.

42 cell crops are classified in one batched pass, then post-processed through gravity enforcement and multi-frame stability voting.

The output logits are converted to probabilities via numerically stable softmax (subtract the max logit before exponentiating to avoid overflow). Argmax picks the class, with a deliberate tie-breaking order: red is checked first, then yellow, then empty. This biases toward colored classifications when logits are close. Empty cells always produce very high-confidence "empty" logits, so ties only occur between the two colors, where this bias is harmless.

Gravity enforcement

Connect Four has a physical constraint that pure classification ignores: pieces fall. A red piece can't float above an empty cell. After classification, a post-processing pass walks each column bottom-to-top. The first time it hits an empty cell, every cell above it is forced to empty, regardless of what the classifier said. This catches edge-effect misclassifications where the grid lines or shadows at column edges confuse the model.

for each column:
    found_empty = false
    for row in 0..6 (bottom to top):
        if cell is empty:
            found_empty = true
        else if found_empty:
            cell = empty   // can't float above gap

Parity validation

Before the grid goes any further, a cheap sanity check: count the red and yellow pieces. In a legal Connect Four game, red always moves first, so either the counts are equal (yellow's turn) or red has exactly one more (red's turn). Any other count means the classification is wrong; reject the frame immediately. This catches gross misclassifications for nearly zero cost and is the first filter in the error-handling pipeline.

Stability voting

A single frame's classification might be noisy: a shadow passes over the board, a hand partially occludes a cell, the camera auto-adjusts exposure. The stability validator smooths this out by requiring consensus across multiple frames before committing to a board state.

It maintains a circular buffer of the 5 most recent classified grids. For each of the 42 cells, it counts how many of the 5 frames agree on that cell's value. If all 42 cells have a supermajority (4 out of 5 frames agree), the grid is promoted to "stable." Otherwise, it stays "unstable" and the app shows a preview but doesn't solve.

After consensus, three physical invariants are checked:

Gravity: No pieces floating above empty cells (same check as post-classification, but on the consensus grid).
Monotonicity: Pieces can only be added, never removed. If cells that were occupied in the last stable grid are now empty, something's wrong. Exception: if more than 6 cells disappear, it's a board reset (new game).
Cell delta: At most 3 cells can change between consecutive stable grids. One new piece plus up to 2 self-corrections from the ML model.

There's one escape hatch: if the consensus grid violates an invariant for 10 consecutive frames (~1.4 seconds at 7fps), the validator overrides the previous stable grid and accepts the new one. This prevents a single misclassification from permanently "poisoning" the reference grid and locking out all future updates.

Top: the 5-frame circular buffer votes per cell; 4/5 agreement required. Bottom: the consensus grid passes through three physical invariants. Failures increment a rejection counter; after 10 consecutive rejections of the same grid, the validator overrides and accepts.

Solving the Game

Connect Four is a solved game. With perfect play from both sides, the first player (red) always wins. The solver in this app doesn't approximate or use heuristics. It exhaustively searches the entire game tree using alpha-beta pruning to find the provably optimal move from any legal position. Every score it reports is mathematically certain.

Bitboard representation

The board is stored as two 64-bit integers: one for red's pieces, one for yellow's. Each column gets 7 bits (6 playable rows plus a sentinel bit that prevents overflow during shift operations). Column-major layout means bit index = column * 7 + row.

Each column uses 7 bits: 6 playable rows plus a sentinel. Bit index = column × 7 + row. The center column (3) is evaluated first during search.

Win detection in ~10 cycles

Checking if a player has won is the hottest inner-loop operation in the solver. Doing it with loops over the 2D grid would be expensive. With bitboards, it takes about 10 CPU cycles using the shift-and-mask trick.

The idea: if you bitwise-AND a player's mask with itself shifted by some amount, any set bit in the result means "this player has a piece here AND a piece at the shifted position." Two adjacent pieces in some direction. Do it again with double the shift, and you've confirmed four in a row.

fn has_four(mask: u64) -> bool {
    // Horizontal: shift by 7 (next column, same row)
    let h = mask & (mask >> 7);
    if h & (h >> 14) != 0 { return true; }

    // Vertical: shift by 1 (same column, next row)
    let v = mask & (mask >> 1);
    if v & (v >> 2) != 0 { return true; }

    // Diagonal /: shift by 8 (up-right)
    let d1 = mask & (mask >> 8);
    if d1 & (d1 >> 16) != 0 { return true; }

    // Diagonal \: shift by 6 (down-right)
    let d2 = mask & (mask >> 6);
    if d2 & (d2 >> 12) != 0 { return true; }

    false
}

The shift amounts come from the column-major layout. Adjacent columns are 7 bits apart (horizontal). Same column, adjacent rows are 1 bit apart (vertical). The diagonals are 8 (7+1) and 6 (7−1). The sentinel bit in each column prevents false positives from wrapping between columns.

Alpha-beta search with opening book

The solver uses negamax with alpha-beta pruning, a standard minimax variant where the score is always from the perspective of the player about to move. Alpha-beta prunes branches that can't possibly affect the final result, reducing the effective branching factor from ~7 to ~3–4.

Move ordering matters enormously for pruning efficiency. Columns are evaluated center-first: [3, 2, 4, 1, 5, 0, 6]. Center columns are statistically stronger in Connect Four (they participate in more potential four-in-a-row lines), so evaluating them first establishes strong bounds early, letting alpha-beta prune more of the remaining branches.

A transposition table caches the scores of previously analyzed positions, avoiding redundant work when the same position is reachable through different move orders. The table persists across calls within a session, providing a 2–10x speedup during continuous gameplay.

For the first 14 plies (7 moves per player), the solver doesn't search at all. It looks up the position in Pascal Pons' opening book, a precomputed database of 8.4 million positions (~32MB) that maps every reachable position within 14 ply to its exact game-theoretic value. The book uses a base-3 symmetric encoding (key3) that exploits the board's left-right mirror symmetry to halve the storage requirement.

Alpha-beta search explores center columns first. Strong bounds from the center let it prune outer branches early (dashed lines = pruned).

Score interpretation

The solver returns a score for each playable column:

Positive score (+N): the current player wins. N represents how many of their own remaining moves it takes to force the win.
Negative score (−N): the current player loses with perfect opponent play.
Zero: the game is a draw with perfect play from both sides.

The raw score is converted to "plies to end," the total number of moves by both players until the game resolves. If the score is +5 and there are 20 pieces on the board, the formula works out how many moves remain and expresses it as a countdown both players can understand.

The FFI boundary

The solver is written in Rust, compiled as a static library, and called from Swift through a C FFI bridge. The interface is three functions:

c4_init()                                    // load opening book
c4_solve_for_both(red_mask, yellow_mask, out) // solve
c4_destroy()                                 // free memory

The Swift side wraps this in an async/await interface backed by a serial dispatch queue. The queue serializes access to the Rust-side mutex, since the solver has mutable internal state (transposition table, position counter) that can't be accessed concurrently. A 15-second timeout prevents the UI from blocking indefinitely on worst-case positions.

Performance

Game phase	Typical time	Source
Opening (0–14 ply)	< 1ms	Book lookup
Mid-game (15–30 ply)	1–50ms	Alpha-beta + TT
Late-game (>30 ply)	< 5ms	Few remaining moves
Worst case	2–8s	Deep positions near TT boundary

The Pipeline

The detection and solving stages need to be coordinated. Frames arrive continuously, the solver takes variable time, the board might disappear and reappear, and the UI needs to stay responsive through all of it. The glue holding this together is a pure reducer-based state machine.

State, events, effects

The core of the pipeline is a single function:

BoardPipeline.reduce(state, event, now) -> (newState, effects)

It takes the current state and an event, returns a new state and a list of side effects to execute. The reducer itself is pure: no async calls, no I/O, no reference types. All side effects (starting a solve, playing audio, canceling a task) are represented as values in the returned effects list. A thin ViewModel layer interprets those effects and executes them.

This matters because it makes the entire state machine testable without mocks. Every test is: set up a state, send an event, assert on the new state and effects. Deterministic, fast, no flaky timing.

The pipeline is a pure reducer. Camera frames flow through detection, debouncing, and the reducer. Effects (solve, announce) are returned as values and executed by the ViewModel. Solver results loop back as events.

Asymmetric debouncing

State transitions between "board found" and "board not found" are debounced with deliberately asymmetric thresholds. Board appearance is fast (160ms), because the user wants quick feedback when they point the camera at the board. Board disappearance is slow (500ms), because losing the board and re-acquiring it triggers a full re-solve cycle, and a hand briefly passing over the board shouldn't cause the UI to flash.

Transitions between states of the same category (stable to stable, unstable to unstable) pass through instantly with zero delay; they've already been validated by the stability voter.

Asymmetric debouncing: board appearance commits quickly (160ms), but disappearance requires a sustained 500ms gap. A hand briefly passing over the board doesn't trigger a re-solve cycle.

Confidence smoothing

The classifier reports a confidence score (average softmax probability across all 42 cells) for each frame. Raw confidence jitters frame-to-frame. An exponential moving average (EMA) smooths it: 25% weight to the new value, 75% to the running average. This reduces UI flicker while still responding to real changes within a few frames.

If smoothed confidence drops below 60%, the solver isn't dispatched. The grid is shown in a "transitional" state, visible to the user but not trusted enough to solve. A wrong solve is worse than no solve.

Speculative caching

Here's a nice optimization. While the grid is still unstable (hasn't achieved 4/5 frame consensus), the pipeline checks: does this grid pass gravity? If so, speculatively solve it in the background and cache the result in a single-slot cache.

When the grid finally stabilizes and matches the cached grid, the result is available instantly, with no waiting for the solver. This eliminates the perceived latency between "board detected" and "move displayed" for the common case where the unstable grid was already correct, just not yet confirmed by the supermajority vote.

The cache is single-slot because Connect Four is monotonic: pieces only get added, never removed. Only the most recent speculative result could possibly match the next stable grid.

Solver deduplication

Stable frames keep arriving while the solver is running. Without deduplication, each stable frame would cancel the in-progress solve and restart it, preventing the solver from ever finishing. The reducer tracks which grid is currently being solved and skips the effect if it's the same grid. Similarly, if the solver completes but the board has already changed, the stale result is stored in the speculative cache (useful if the board changes back) but not applied to the UI.

What You Hear and See

The app was designed with Meta Ray-Ban glasses in mind (the glasses integration is stubbed, anticipating their display SDK). The idea is that you can't look at a screen while playing, so everything the solver knows needs to be communicated through audio. The visual HUD exists for the glasses' display, but audio is the primary channel.

Two audio channels

There are two independent audio streams, overlaid:

Move channel: Repeats the recommended column continuously. "Red four. Red four. Red four." Each clip plays, then a 300ms gap, then it repeats. This is the core tactical information: which column to play next.
Win distance channel: Every 8 seconds, overlaid on top of the move channel: "Red wins in... thirteen." This gives strategic context: how far ahead or behind you are. Plays at 1.5x speed and 50% volume so it sits underneath the move announcements without overpowering them.

Two overlaid audio channels: the move channel repeats the recommended column continuously; the win distance channel overlays the game-theoretic outcome every 8 seconds.

All 61 audio clips (7 columns × 2 colors, plus win announcements, plus numbers 1–42 for ply counts, plus outcome prefixes) are pre-loaded into memory at app launch. No disk I/O during playback. A generation counter prevents stale callbacks: when the board changes and a new announcement starts, the counter increments, and any pending callbacks from the old announcement check the counter and silently exit.

The HUD

The visual overlay is a compact SwiftUI view pinned to the bottom of the camera feed. It uses a fixed-height slot architecture: two slots (move display at 60pt, board preview at 130pt) that are always present in the view hierarchy, with content toggled via opacity rather than conditional rendering. This prevents SwiftUI from tearing down and rebuilding the view tree on every state change, which would cause layout thrashing at 30fps.

The move display shows both players' recommended columns (R→4 Y→5) and the game outcome (R:13 meaning "red wins in 13 plies") in 36pt bold monospace, readable on small screens and through glasses. Column numbers use smooth numeric transitions (.contentTransition(.numericText())) so the display morphs rather than flashes when the recommendation changes.

The board preview is a miniature 7×6 grid of colored circles: red, yellow, and translucent white for empty cells. It gives the user immediate visual confirmation that the app is seeing the board correctly, and helps diagnose classification errors when things go wrong.

The HUD overlay: two fixed-height slots anchored to the bottom of the camera feed. The move display shows both players' recommended columns and the game outcome. The board preview mirrors the detected grid state.

Recording

For analysis and sharing, the HUD can be recorded as an H.264 video at 30fps. The recorder uses a CADisplayLink synced to the display refresh, pulling the latest rendered frame from a single-slot pixel buffer cache. If the HUD updates faster than 30fps, intermediate frames are dropped. If it updates slower, the last frame is repeated.

The recording variant uses a white background with dark text (instead of the live HUD's translucent overlay on the camera feed), making it legible as a standalone video. The writer is created lazily on the first frame, locking dimensions to exactly what SwiftUI rendered, so no size estimation is needed.

Release Timeline

Meta currently restricts third-party app publishing on their glasses to select partners only, through the Wearables Device Access Toolkit preview. Broader public publishing is expected to open sometime in 2026, but no specific date has been announced. A public release of this app is entirely dependent on when that happens.

I hope you enjoyed and found this technical write-up useful. Check out my other AI glasses app at saythisapp.com, meant for eliminating your approach anxiety altogether so you can approach any girl you want in public.