Ideas2IT Built St. Louis University's Geospatial Research Data Platform on AWS, Processing 300TB in Five Days and Cutting Project Costs by 80%

A US research university generating petabytes of geospatial data needed a cost-effective way to ingest, normalize, and query it at scale. Ideas2IT built the AWS-native data warehouse and preprocessing pipeline that made it possible.

Client

St. Louis University

Industry

Education

Service

Data Modernization

Data Processed

300TB in 5 days

Budget Savings

80% cost reduction

01 Challenge

St. Louis University's researchers received petabytes of geospatial data in unworkable formats from multiple external sources. No preprocessing layer existed: deduplication, timezone conversion, and partitioning were manual operations. At petabyte scale, running geospatial analysis directly against raw data made every query slow and the compute costs unsustainable.

02 Solution

Ideas2IT built a geo data warehouse on AWS S3 as the structured storage layer, with a preprocessing pipeline handling deduplication, timezone conversion, and geo-partitioning before any analysis workload ran. Apache Sedona on AWS EMR ran geospatial queries against the partitioned warehouse instead of raw data. Spark optimizations across the pipeline reduced both compute time and cost.

03 Outcome

Three hundred terabytes of geospatial data preprocessed in five days. Project budget costs dropped 80%. Redshift Spectrum and Apache Sedona running against a partitioned S3 warehouse replaced unoptimized raw queries at petabyte scale.

Phase 01

From unworkable raw dumps to a queryable, partitioned data warehouse

Data Ingestion and Geo Warehouse Architecture: building a structured AWS foundation for petabyte-scale geospatial data

The first architectural decision was storage structure. Data arriving from external sources in heterogeneous formats had to be normalized before it could be stored usefully, which meant the preprocessing layer had to be designed before the warehouse.

Ideas2IT built

  1. a geo data warehouse on AWS S3 with a preprocessing stage that ran deduplication, timezone conversion, and geospatial partitioning as the ingestion entry point.
  2. Every record arriving in the system passed through this normalization pipeline before landing in the warehouse.
  3. AWS EMR handled the distributed transformation workload at petabyte scale, partitioning geospatial records to make downstream analysis both faster and cheaper to run.

This Phase Produced

  • Geo data warehouse on AWS S3
  • Structured storage layer with partitioned schema for geospatial records
  • Automated preprocessing pipeline
  • Deduplication, timezone conversion, and partitioning at ingestion
  • AWS EMR transformation cluster
  • Distributed compute for geo-partitioning at petabyte scale
  • Apache Sedona geospatial analysis layer
  • Framework enabling cost-effective spatial queries on S3-partitioned data
  • Redshift Spectrum integration
  • Query layer surfacing partitioned S3 data to analysis workloads
  • Spark optimization configuration
  • Pipeline tuning for compute efficiency and cost reduction

Phase 02

Replacing raw-data queries with a structured, cost-efficient analysis layer

Spark Optimization and Cost Architecture: reducing per-query compute costs by 80% across the research workload

With the warehouse and preprocessing pipeline in place, the cost problem was still structural: geospatial queries at petabyte scale are expensive if they run against unoptimized data.

The team configured Apache Sedona running on AWS EMR to query against the S3-partitioned warehouse rather than raw inputs, which eliminated the compute overhead that made large spatial queries unsustainable.

Redshift Spectrum gave research teams SQL-accessible query access to the partitioned S3 layer without loading data into Redshift itself. Spark optimizations across the full pipeline, covering job configuration, partition sizing, and execution planning, drove the 80% project budget reduction and made 300TB preprocessing achievable within five days.

This Phase Produced

  • Apache Sedona on EMR — optimized configuration
  • Geospatial query framework tuned for partitioned S3 warehouse
  • Redshift Spectrum query access layer
  • SQL access to S3-partitioned geospatial data without Redshift ingestion
  • Spark job optimization framework
  • Partition sizing, execution planning, and pipeline efficiency configuration
  • Cost reduction architecture
  • Pipeline design producing 80% savings vs. unoptimized raw-data query approach
  • Geocode and date-range retrieval capability
  • Queryable warehouse enabling combined spatial and temporal data retrieval

The Outcome

300TB in five days and 80% cost savings: geospatial research analytics made viable at petabyte scale.

Category Metric Description
Data processing speed 300TB in
5 days
Geo-partitioning and normalization via AWS EMR replaced sequential manual processing
Infrastructure costs 80% reduction Apache Sedona querying against partitioned S3 warehouse removed raw-data compute overhead
Query efficiency Significant
reduction
Redshift Spectrum and Sedona on partitioned data eliminated full-dataset scan costs
Storage architecture Petabyte-scale AWS S3 partitioned warehouse structured for combined geocode and date-range retrieval
Stack AWS·Apache
Sedona·Spark
Fully cloud-native, optimized for geospatial research workloads
The 80% cost reduction and five-day preprocessing result were a direct consequence of architectural sequencing: normalizing data at ingestion before any query touched it, then placing an optimized geospatial query layer over a partitioned warehouse instead of raw data. Apache Sedona on AWS EMR running against partitioned S3 is structurally less expensive than unoptimized spatial queries at petabyte scale. The cost followed from the architecture.