Fixing Low Recall & Latency in Vertex AI Vector Search with Filtering

You have ingested ten million vector embeddings into Google Cloud’s Vertex AI Vector Search. Your similarity search works perfectly for broad queries. However, as soon as you apply strict business logic—like filtering for "red shoes" that are "in stock" and "under $100"—your results vanish.

You receive zero results, or worse, irrelevant results, even though you know the items exist in your database.

This is the classic HNSW "Broken Graph" problem. It is the single most common reason for low recall in production vector search systems. When metadata filtering is mishandled, the Approximate Nearest Neighbor (ANN) algorithm cannot traverse the graph to find the valid nodes.

This guide details exactly why this happens mechanically and provides the production-grade Python code required to implement native filtering correctly in Vertex AI to restore 100% recall without destroying latency.

The Root Cause: Why Filters Break HNSW Graphs

To fix the problem, you must understand the underlying data structure. Vertex AI Vector Search uses HNSW (Hierarchical Navigable Small World) graphs.

In an HNSW index, data points (vectors) are nodes in a multi-layered graph. The algorithm finds the nearest neighbors by hopping from node to node based on proximity.

The "Disconnected Island" Scenario

When you perform a standard vector search, the algorithm traverses the graph greedily. However, when you apply a filter (e.g., color="red"), the algorithm is effectively forbidden from visiting nodes that do not match that filter.

If the "red" nodes are sparsely distributed or clustered far away from the entry point, the graph traversal hits a dead end. The algorithm cannot "hop" over the non-red nodes to reach the relevant ones. It terminates early, returning empty results or a partial list, resulting in low recall.

Post-Filtering vs. Native Filtering

Many developers attempt to solve this with Post-Filtering:

Fetch the top 1000 nearest neighbors.
Filter the list in memory using Python.

This fails at scale. If your top 1000 neighbors are all "blue," and you filter for "red," you are left with zero results.

The solution is Native Pre-Filtering (or coordinate filtering). You must configure the Vertex AI index to understand the metadata during the graph traversal, allowing the algorithm to treat non-matching nodes as invisible but navigable, or use internal optimized inverted indexes to restrict the search space efficiently.

Phase 1: Configuring the Index for Filtering

You cannot filter on metadata if the index was not built to support it. You must define a metadata_config during the index creation process.

This configuration tells Vertex AI which fields in your JSON payload should be indexed for filtering.

from google.cloud import aiplatform

# Initialize the Vertex AI SDK
aiplatform.init(
    project="your-gcp-project-id",
    location="us-central1",
    staging_bucket="gs://your-staging-bucket"
)

# Define the index configuration
# CRITICAL: You must specify `contentsDeltaUri` pointing to your initial data
# and config to enable filtering.
my_index = aiplatform.VectorSearchIndex.create(
    display_name="prod-ecommerce-index",
    description="Product catalog with metadata filtering",
    metadata_config={
        "approximateNeighborsCount": 150, # Controls graph connectivity density
        "distanceMeasureType": "DOT_PRODUCT_DISTANCE", 
        "algorithmConfig": {
            "treeAhConfig": {
                "leafNodeEmbeddingCount": 500,
                "leafNodesToSearchPercent": 10 
            }
        }
    },
    # The crucial part: Mapping metadata attributes
    index_update_method="STREAM_UPDATE", 
)

Note: For Standard/HNSW indexes, Vertex AI now automatically detects string and numeric fields for filtering if your input data follows the specific structure below. Explicit mapping is often required only for legacy Matching Engine versions, but ensuring your JSON structure matches the schema is vital.

Phase 2: Structuring Data for Ingestion

The structure of your embedding file (JSONL) dictates your filtering capabilities. You cannot just dump a JSON object. You must use the restricts list for tokens (strings) and numeric_restricts for values (integers/floats).

The Correct JSONL Format

Save your embeddings to Google Cloud Storage (GCS) in this format:

{"id": "product_8812", "embedding": [0.01, 0.54, ...], "restricts": [{"namespace": "color", "allow": ["red"]}, {"namespace": "status", "allow": ["in_stock"]}], "numeric_restricts": [{"namespace": "price", "value_float": 99.50}]}
{"id": "product_9914", "embedding": [0.05, 0.11, ...], "restricts": [{"namespace": "color", "allow": ["blue"]}, {"namespace": "status", "allow": ["out_of_stock"]}], "numeric_restricts": [{"namespace": "price", "value_float": 150.00}]}

Key Data Rules:

namespace: Think of this as the column name (e.g., "color").
allow: The values associated with that column. This is an array, allowing a single item to belong to multiple categories (e.g., a shoe can be both "sport" and "casual").
numeric_restricts: strictly for range filtering (greater than, less than).

Phase 3: Implementing the Query with Filters

Once the index is deployed to an Endpoint, you query it using the find_neighbors method.

Here is the robust, production-ready implementation that handles both categorical (allow list) and numeric filtering.

from google.cloud import aiplatform
from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    Namespace,
    NumericNamespace,
)
from typing import List, Optional

def query_vector_index(
    endpoint_id: str,
    deployed_index_id: str,
    query_vector: List[float],
    category_filters: Optional[List[dict]] = None,
    price_max: Optional[float] = None
):
    """
    Executes a vector search with hard metadata filtering.
    """
    
    # Connect to existing endpoint
    index_endpoint = aiplatform.VectorSearchIndexEndpoint(
        index_endpoint_name=endpoint_id
    )

    # 1. Build Categorical Filters (The 'Allow' List)
    #    This ensures we only traverse nodes matching specific tags.
    allow_list = []
    if category_filters:
        for filter_item in category_filters:
            allow_list.append(
                Namespace(
                    name=filter_item['key'], 
                    allow_tokens=filter_item['values']
                )
            )

    # 2. Build Numeric Filters
    #    Example: Restricting search to items under a certain price.
    numeric_list = []
    if price_max is not None:
        numeric_list.append(
            NumericNamespace(
                name="price",
                value_float=price_max,
                op="LESS" # Options: LESS, LESS_EQUAL, GREATER, GREATER_EQUAL, EQUAL
            )
        )

    # 3. Execute Query
    response = index_endpoint.find_neighbors(
        deployed_index_id=deployed_index_id,
        queries=[query_vector],
        num_neighbors=10,
        filter=allow_list,           # Apply categorical filters
        numeric_filter=numeric_list  # Apply numeric filters
    )

    return response

# Example Usage
# Assume 'vector' is a 768-dim list generated by your embedding model
results = query_vector_index(
    endpoint_id="projects/123/locations/us-central1/indexEndpoints/456",
    deployed_index_id="my_deployed_index",
    query_vector=[0.05, -0.02, ...], 
    category_filters=[
        {"key": "color", "values": ["red", "maroon"]},
        {"key": "status", "values": ["in_stock"]}
    ],
    price_max=100.0
)

for match in results[0]:
    print(f"ID: {match.id}, Score: {match.distance}")

Tuning for Latency: The `restrict_num_neighbors` Trap

There is a hidden configuration often overlooked in documentation: Result Window Sizing.

When you apply a filter, Vertex AI might need to search deeper into the graph to find $k$ neighbors that satisfy the condition. If your filter is very restrictive (e.g., matches only 0.1% of the database), the search latency will spike because the engine is discarding thousands of nodes.

To control this, Vertex AI attempts to balance accuracy and speed. If you are still seeing low recall despite correct code, you may need to adjust the number of neighbors the system inspects before filtering.

If the filtered set is extremely small, Vertex AI automatically switches from HNSW traversal to a Brute Force search on the subset of data matching the filter. This guarantees 100% recall but can be slower if the subset is large (e.g., >100k items).

Ensuring High Recall on Sparse Filters

If you have a high-cardinality field (e.g., user_id where every user is unique), HNSW performance degrades.

For high-cardinality filtering, verify your shard_size during index creation. Smaller shard sizes generally handle heavy filtering better because the "islands" of valid data are less dispersed, but this increases cost.

Summary

The "Zero Results" error in Vector Search is rarely an issue with your embeddings; it is an issue with graph connectivity.

Stop Post-Filtering: It destroys recall.
Schema Enforcement: Ensure ingestion JSON uses restricts and numeric_restricts.
Query Correctly: Use Namespace and NumericNamespace classes from the SDK.
Monitor Density: If a filter matches <1% of your data, expect Vertex to switch to brute force for that query. This is normal and desirable for accuracy.

By pushing the filtering logic down into the search engine's traversal layer, you maintain the speed of HNSW while respecting the strict logic required by business applications.

Programming Tutorials

Search This Blog