Architecting Weather Data Storage

Architecting Weather Data Storage

← Go back

Architecting Weather Data Storage

The Fundamental Problem

Weather forecast systems face a unique data challenge: serving millions of point queries from datasets containing billions of values. A global weather model with 4 million grid cells, 168 hourly timestamps, and 35 variables contains nearly 24 billion data points. Users typically request a tiny fraction: the forecast for one location. This massive selectivity ratio (retrieving ~6,000 values from 24 billion) defines our entire storage strategy.

The problem compounds with continuous updates. Weather models run every 3-6 hours, each execution partially overwrites previous predictions with newer, more accurate data. Any storage solution must handle constant writes while maintaining read performance.

Access Patterns Drive Architecture

Start with how data is accessed, not how it's naturally structured. Weather data originates as grid snapshots at discrete timestamps, but APIs serve time-series for specific locations. This mismatch between production and consumption patterns is the root cause of most performance problems.

Consider two query types:

  1. "Show global temperature at 3 PM tomorrow" (spatial query)
  2. "Show this week's hourly forecast for New York" (time-series query)

Most applications need the second type. Optimizing for spatial queries when 99% of requests are time-series queries is architectural malpractice. Storage must align with dominant access patterns.

Why Traditional Approaches Fail

Naive File Storage

Storing each timestamp as a separate gridded file seems logical but requires opening 168 files to build a week's forecast. File operations have fixed overhead regardless of data size. Reading one value from 168 files takes longer than reading 168 values from one file. Multiplication of overhead kills performance.

Relational Databases

Tables with rows for each observation appear to solve the problem but introduce different inefficiencies. Every row duplicates metadata (timestamp, coordinates), indices consume massive memory, and bulk updates during model refreshes lock tables for hours. The fundamental issue: relational databases optimize for flexible queries, but weather APIs have completely predictable access patterns. This flexibility has a cost we don't need to pay.

The Solution: Reorganize Around Usage

Instead of storing data as produced (time-then-location), store it as consumed (location-then-time). Transform the three-dimensional dataset from [timestamp][latitude][longitude] to [latitude][longitude][timestamp]. Now each geographic point's complete time-series is contiguous on disk.

This reorganization enables single-read operations for entire forecasts. Disk controllers and operating systems are optimized for sequential reads. Modern SSDs can read sequential data at 3+ GB/second but random reads are 100x slower. By making time-series queries sequential, we align with hardware capabilities.

Critical Implementation Details

Update Strategy

New model runs must merge with existing data without disrupting reads. In-place updates work best: maintain fixed-size files where new runs overwrite specific byte ranges. This eliminates file fragmentation and simplifies backup strategies. Each file holds perhaps 10 days of data, with model runs continuously updating the relevant portions.

Memory Mapping

Treat files as arrays in memory using mmap(). The operating system handles paging, keeping frequently accessed data in RAM. Popular locations stay cached automatically. This provides database-like convenience with file-based performance.

Data Type Selection

Not all variables need 32-bit precision. Temperature can use 16-bit floats, saving 50% storage with negligible accuracy loss. Binary flags (precipitation yes/no) need just one bit. Careful data type selection can reduce storage by 60% or more.

Compression Trade-offs

Compression saves storage but adds CPU overhead. For archived data, aggressive compression makes sense. For active forecasts serving thousands of requests per second, decompression latency may exceed storage savings. Measure actual workload patterns before committing to compression.

Scaling Considerations

Horizontal Partitioning

Divide the globe into regions, each stored on different servers. North American queries hit North American servers. This provides natural load balancing and enables regional deployment close to users.

Time-Based Tiering

Recent forecasts need millisecond access. Last month's data can tolerate 100ms latency. Last year's might accept seconds. Use SSDs for current data, HDDs for recent history, and object storage for archives. Let access patterns determine storage tiers.

Distributed Challenges

Distributed filesystems introduce network latency. A local SSD reads in microseconds; network storage adds milliseconds. Cache aggressively and consider read replicas over shared storage for performance-critical paths.

Practical Validation

Well-architected systems achieve remarkable metrics:

  • Single forecast retrieval: <2ms
  • Cached queries: <0.5ms
  • Storage efficiency: 10-20% of naive approaches
  • Update latency: Minutes, not hours
  • Infrastructure cost: 80-90% reduction

These aren't theoretical limits but actual production measurements from systems serving millions of daily requests.

Beyond Weather: Universal Principles

The core insight transcends meteorology. Any system with these characteristics benefits from similar architecture:

  • Large multi-dimensional datasets
  • Predictable access patterns
  • Frequent partial updates
  • Time-series queries dominating spatial queries

IoT sensor networks, financial tick data, satellite imagery, and monitoring systems all exhibit these patterns. The solution remains consistent: organize storage around consumption patterns, not production patterns.

Key Lessons

  1. Profile before architecting: Measure actual query patterns. Don't assume.
  2. Embrace specialization: General-purpose databases solve general problems generally. Specific problems deserve specific solutions.
  3. Hardware awareness matters: Sequential reads, page sizes, and cache hierarchies aren't implementation details. They're fundamental constraints that shape architecture.
  4. Denormalization is a tool: Storage is cheap. Computation is expensive. Trade space for time when access patterns are predictable.
  5. Update strategies define systems: How data changes is as important as how it's queried. Design for your update pattern, not against it.

Conclusion

Efficient weather data storage isn't about clever algorithms or exotic databases. It's about accepting that data organization must reflect usage patterns. When we store time-series data as time-series, not as grids or relations, performance improvements aren't incremental but transformational. The same principle applies broadly: align storage with access, and complex problems become simple. Fight against this alignment, and simple problems become complex. The choice, and the consequence, is architectural.

Read more: Building A Supply Chain Risk Management Platform