Ideas2IT Built St. Louis University's Geospatial Research Data Platform on AWS, Processing 300TB in Five Days and Cutting Project Costs by 80%

A US research university generating petabytes of geospatial data needed a cost-effective way to ingest, normalize, and query it at scale. Ideas2IT built the AWS-native data warehouse and preprocessing pipeline that made it possible.

Client

St. Louis University

Industry

Education

Service

Data Modernization

Data Processed

300TB in 5 days

Budget Savings

80% cost reduction

01 Challenge

St. Louis University's researchers received petabytes of geospatial data in unworkable formats from multiple external sources. No preprocessing layer existed: deduplication, timezone conversion, and partitioning were manual operations. At petabyte scale, running geospatial analysis directly against raw data made every query slow and the compute costs unsustainable.

02 Solution

Ideas2IT built a geo data warehouse on AWS S3 as the structured storage layer, with a preprocessing pipeline handling deduplication, timezone conversion, and geo-partitioning before any analysis workload ran. Apache Sedona on AWS EMR ran geospatial queries against the partitioned warehouse instead of raw data. Spark optimizations across the pipeline reduced both compute time and cost.

03 Outcome

Three hundred terabytes of geospatial data preprocessed in five days. Project budget costs dropped 80%. Redshift Spectrum and Apache Sedona running against a partitioned S3 warehouse replaced unoptimized raw queries at petabyte scale.

Phase 01

From unworkable raw dumps to a queryable, partitioned data warehouse

Data Ingestion and Geo Warehouse Architecture: building a structured AWS foundation for petabyte-scale geospatial data

The first architectural decision was storage structure. Data arriving from external sources in heterogeneous formats had to be normalized before it could be stored usefully, which meant the preprocessing layer had to be designed before the warehouse.

Ideas2IT built

a geo data warehouse on AWS S3 with a preprocessing stage that ran deduplication, timezone conversion, and geospatial partitioning as the ingestion entry point.
Every record arriving in the system passed through this normalization pipeline before landing in the warehouse.
AWS EMR handled the distributed transformation workload at petabyte scale, partitioning geospatial records to make downstream analysis both faster and cheaper to run.

This Phase Produced

Geo data warehouse on AWS S3
Structured storage layer with partitioned schema for geospatial records
Automated preprocessing pipeline
Deduplication, timezone conversion, and partitioning at ingestion
AWS EMR transformation cluster
Distributed compute for geo-partitioning at petabyte scale
Apache Sedona geospatial analysis layer
Framework enabling cost-effective spatial queries on S3-partitioned data
Redshift Spectrum integration
Query layer surfacing partitioned S3 data to analysis workloads
Spark optimization configuration
Pipeline tuning for compute efficiency and cost reduction

Phase 02

Replacing raw-data queries with a structured, cost-efficient analysis layer

Spark Optimization and Cost Architecture: reducing per-query compute costs by 80% across the research workload

With the warehouse and preprocessing pipeline in place, the cost problem was still structural: geospatial queries at petabyte scale are expensive if they run against unoptimized data.

The team configured Apache Sedona running on AWS EMR to query against the S3-partitioned warehouse rather than raw inputs, which eliminated the compute overhead that made large spatial queries unsustainable.

Redshift Spectrum gave research teams SQL-accessible query access to the partitioned S3 layer without loading data into Redshift itself. Spark optimizations across the full pipeline, covering job configuration, partition sizing, and execution planning, drove the 80% project budget reduction and made 300TB preprocessing achievable within five days.

This Phase Produced

Apache Sedona on EMR — optimized configuration
Geospatial query framework tuned for partitioned S3 warehouse
Redshift Spectrum query access layer
SQL access to S3-partitioned geospatial data without Redshift ingestion
Spark job optimization framework
Partition sizing, execution planning, and pipeline efficiency configuration
Cost reduction architecture
Pipeline design producing 80% savings vs. unoptimized raw-data query approach
Geocode and date-range retrieval capability
Queryable warehouse enabling combined spatial and temporal data retrieval

The Outcome

300TB in five days and 80% cost savings: geospatial research analytics made viable at petabyte scale.

Category	Metric	Description
Data processing speed	300TB in 5 days	Geo-partitioning and normalization via AWS EMR replaced sequential manual processing
Infrastructure costs	80% reduction	Apache Sedona querying against partitioned S3 warehouse removed raw-data compute overhead
Query efficiency	Significant reduction	Redshift Spectrum and Sedona on partitioned data eliminated full-dataset scan costs
Storage architecture	Petabyte-scale	AWS S3 partitioned warehouse structured for combined geocode and date-range retrieval
Stack	AWS·Apache Sedona·Spark	Fully cloud-native, optimized for geospatial research workloads

The 80% cost reduction and five-day preprocessing result were a direct consequence of architectural sequencing: normalizing data at ingestion before any query touched it, then placing an optimized geospatial query layer over a partitioned warehouse instead of raw data. Apache Sedona on AWS EMR running against partitioned S3 is structurally less expensive than unoptimized spatial queries at petabyte scale. The cost followed from the architecture.