Automating Sentinel-2 Cloud Masking with STAC and Rasterio

Building deterministic, audit-ready reflectance composites for greenhouse gas MRV pipelines requires precise atmospheric correction and rigorous cloud exclusion. Automating Sentinel-2 Cloud Masking with STAC and Rasterio establishes a reproducible spatial data foundation that satisfies ISO 14064-3 verification standards and GHG Protocol spatial accounting requirements. The engineering intent is singular: extract the Scene Classification Layer (SCL) from ESA’s Level-2A archive via STAC catalogs, apply deterministic boolean masking to spectral bands, and enforce fallback routing when classification metadata degrades. This workflow eliminates manual QA/QC bottlenecks while preserving pixel-level provenance for carbon stock change verification. For engineering teams scaling Satellite Imagery Processing for Emissions Tracking, deterministic masking functions as the primary compliance gate before temporal aggregation or flux modeling.

flowchart TB A["STAC search<br/>sentinel-2-l2a · AOI · date"] --> B{"SCL asset present?"} B -->|yes| C["Conservative SCL mask<br/>classes 3,8,9,10 + dilation"] B -->|no| D["Fallback:<br/>QA60 or exclude"] C --> E["Windowed masked read<br/>strict CRS check"] D --> E E --> F["Audit record<br/>exclusion % · compliance status"]

1. Deterministic STAC Asset Resolution

The STAC API serves as the deterministic entry point for catalog traversal. Rather than relying on static tile directories or ad-hoc HTTP scrapers, pipelines must query the sentinel-2-l2a collection using pystac-client with explicit spatial-temporal filters and asset resolution constraints. The SCL band must be requested alongside target spectral assets (typically B04, B08, B11 for vegetation and soil carbon proxies). When querying production catalogs, enforce eo:cloud_cover thresholds at the item level, but recognize that STAC-reported cloud cover is a scene-level aggregate that does not guarantee cloud-free conditions within your specific Area of Interest (AOI). Pixel-level masking remains mandatory for MRV compliance.

import pystac_client
from datetime import datetime
from typing import Dict, Any, Tuple

def resolve_s2_l2a_assets(
    stac_url: str,
    aoi_bounds: Tuple[float, float, float, float],
    start_date: str,
    end_date: str,
    max_cloud_pct: float = 30.0
) -> Dict[str, Any]:
    client = pystac_client.Client.open(stac_url)
    search = client.search(
        collections=["sentinel-2-l2a"],
        bbox=aoi_bounds,
        datetime=f"{start_date}/{end_date}",
        query={"eo:cloud_cover": {"lt": max_cloud_pct}},
        sortby=[{"field": "datetime", "direction": "desc"}]
    )
    items = list(search.items())
    if not items:
        raise RuntimeError(f"No S2-L2A items found for bounds {aoi_bounds} and period {start_date}/{end_date}.")

    # Deterministic selection: latest acquisition meeting threshold
    item = items[0]
    return {
        "scl": item.assets["SCL"].href,
        "b04": item.assets["B04"].href,
        "b08": item.assets["B08"].href,
        "b11": item.assets["B11"].href,
        "item_id": item.id,
        "datetime": item.datetime.isoformat(),
        "crs": f"EPSG:{item.properties.get('proj:epsg', 32633)}"
    }

2. SCL Classification Mapping & Conservative Masking

Once assets are resolved, the SCL band encodes discrete classification values that must be mapped to a boolean cloud/shadow mask. ESA’s SCL v2/v3 schema assigns 3 to cloud shadow, 8 to medium-probability cloud, 9 to high-probability cloud, and 10 to thin cirrus. For carbon accounting, conservative masking is required: all four classes must be excluded to prevent spectral contamination of NDVI/NDWI indices and biophysical parameter retrieval. Morphological dilation is applied to capture sub-pixel cloud edges that frequently bleed into adjacent valid pixels during atmospheric scattering.

import numpy as np
from scipy.ndimage import binary_dilation

# ESA SCL v2/v3 conservative exclusion classes
EXCLUSION_CLASSES = {3, 8, 9, 10}

def generate_cloud_mask(scl_array: np.ndarray, dilation_iterations: int = 2) -> np.ndarray:
    raw_mask = np.isin(scl_array, list(EXCLUSION_CLASSES))
    structure = np.ones((3, 3), dtype=bool)
    # Dilation captures adjacency contamination from cloud edges
    conservative_mask = binary_dilation(raw_mask, structure=structure, iterations=dilation_iterations)
    return conservative_mask

3. Rasterio Windowed I/O & Strict CRS Alignment

Rasterio handles the windowed I/O required for memory-efficient masking at scale. Strict CRS alignment is non-negotiable for MRV pipelines; mismatched coordinate reference systems introduce geometric bias that invalidates spatial accounting. The pipeline validates CRS parity before reading, extracts the exact window intersecting the AOI, applies the boolean mask, and writes masked reflectance arrays with explicit nodata flags. This ensures downstream temporal aggregators can safely stack pixels without propagating cloud-contaminated values.

import rasterio
from rasterio.windows import from_bounds

def apply_mask_to_bands(
    band_paths: Dict[str, str],
    scl_path: str,
    aoi_bounds: Tuple[float, float, float, float],
    target_crs: str,
    output_dir: str
) -> Dict[str, str]:
    with rasterio.open(scl_path) as scl_src:
        src_crs = scl_src.crs.to_string()
        if src_crs != target_crs:
            raise ValueError(f"SCL CRS {src_crs} does not match target {target_crs}. Reproject before masking.")
        window = from_bounds(*aoi_bounds, scl_src.transform)
        scl_data = scl_src.read(1, window=window)
        transform = scl_src.window_transform(window)

    cloud_mask = generate_cloud_mask(scl_data)
    masked_outputs = {}

    for band_name, band_path in band_paths.items():
        with rasterio.open(band_path) as band_src:
            band_window = from_bounds(*aoi_bounds, band_src.transform)
            band_data = band_src.read(1, window=band_window).astype(np.float32)
            band_data[cloud_mask] = np.nan  # Enforce NaN for cloud/shadow
            
            out_path = f"{output_dir}/{band_name}_masked.tif"
            with rasterio.open(
                out_path, 'w',
                driver='GTiff',
                height=cloud_mask.shape[0],
                width=cloud_mask.shape[1],
                count=1,
                dtype='float32',
                crs=target_crs,
                transform=transform,
                compress='deflate',
                nodata=np.nan
            ) as dst:
                dst.write(band_data, 1)
            masked_outputs[band_name] = out_path
            
    return masked_outputs

4. Fallback Routing & Degraded Metadata Handling

Production STAC catalogs occasionally contain items with missing SCL assets, corrupted processing baselines, or incomplete metadata. MRV pipelines must implement deterministic fallback routing rather than silent failures. If SCL is absent, the pipeline checks for QA60 (L1C bitmask) as a secondary exclusion layer. If neither exists, the item is flagged for exclusion and routed to a compliance exception queue. This aligns with Sentinel-2 & Landsat Cloud Masking Workflows where multi-sensor parity and graceful degradation are enforced.

import pystac

def evaluate_scl_integrity(item: pystac.Item) -> str:
    if "SCL" not in item.assets:
        if "QA60" in item.assets:
            return "QA60_FALLBACK"
        return "EXCLUDED_NO_MASK"
    
    baseline = item.properties.get("s2:processing_baseline")
    if baseline in ["02.00", "02.01"]:
        return "LEGACY_SCL_REVIEW"
    return "VALID"

5. Audit Trail Generation & Compliance Gating

ISO 14064-3 verification requires explicit documentation of data provenance, exclusion criteria, and uncertainty flags. The pipeline generates a structured JSON audit record for every processed item, capturing masking parameters, exclusion percentages, CRS metadata, and compliance status. This record is ingested by downstream verification systems to satisfy GHG Protocol spatial accounting requirements and enable third-party auditor traceability.

import json
from pathlib import Path

def generate_audit_record(
    item_id: str,
    acquisition_dt: str,
    masked_files: Dict[str, str],
    cloud_pct_excluded: float,
    crs: str,
    compliance_standard: str = "ISO_14064-3"
) -> Path:
    audit = {
        "pipeline_version": "s2_masking_v2.1",
        "item_id": item_id,
        "acquisition_datetime": acquisition_dt,
        "crs": crs,
        "masked_assets": masked_files,
        "exclusion_classes_applied": sorted(list(EXCLUSION_CLASSES)),
        "cloud_pixel_pct_excluded": round(cloud_pct_excluded, 4),
        "compliance_standard": compliance_standard,
        "verification_status": "PASS" if cloud_pct_excluded < 0.35 else "REVIEW_REQUIRED",
        "generated_at": datetime.utcnow().isoformat()
    }
    audit_path = Path(f"audit/{item_id}_masking_audit.json")
    audit_path.parent.mkdir(parents=True, exist_ok=True)
    with open(audit_path, "w") as f:
        json.dump(audit, f, indent=2)
    return audit_path

Pipeline Integration & Scaling

This masking module serves as the foundational preprocessing step for carbon flux estimation, deforestation alert generation, and land-use change temporal aggregation. By enforcing strict CRS alignment, deterministic SCL mapping, and explicit audit trail generation, engineering teams eliminate spectral contamination risks that historically invalidated MRV submissions. For reference on atmospheric correction baselines and SCL algorithmic updates, consult the Sentinel-2 Level-2A Algorithm Theoretical Basis Document. Implementation details for windowed I/O and CRS transformations are documented in the Rasterio API Reference. When deployed alongside async tile processors and multi-sensor fusion routines, this workflow scales to continental MRV operations while maintaining verifiable compliance with international greenhouse gas accounting standards.