Ideas2IT Built St. Louis University's Geospatial Research Data Platform on AWS, Processing 300TB in Five Days and Cutting Project Costs by 80%
A US research university generating petabytes of geospatial data needed a cost-effective way to ingest, normalize, and query it at scale. Ideas2IT built the AWS-native data warehouse and preprocessing pipeline that made it possible.


Client
St. Louis University

Industry
Education

Service
Data Modernization

Data Processed
300TB in 5 days

Budget Savings
80% cost reduction
01 Challenge
St. Louis University's researchers received petabytes of geospatial data in unworkable formats from multiple external sources. No preprocessing layer existed: deduplication, timezone conversion, and partitioning were manual operations. At petabyte scale, running geospatial analysis directly against raw data made every query slow and the compute costs unsustainable.
02 Solution
Ideas2IT built a geo data warehouse on AWS S3 as the structured storage layer, with a preprocessing pipeline handling deduplication, timezone conversion, and geo-partitioning before any analysis workload ran. Apache Sedona on AWS EMR ran geospatial queries against the partitioned warehouse instead of raw data. Spark optimizations across the pipeline reduced both compute time and cost.
03 Outcome
Three hundred terabytes of geospatial data preprocessed in five days. Project budget costs dropped 80%. Redshift Spectrum and Apache Sedona running against a partitioned S3 warehouse replaced unoptimized raw queries at petabyte scale.
Phase 01
From unworkable raw dumps to a queryable, partitioned data warehouse
Data Ingestion and Geo Warehouse Architecture: building a structured AWS foundation for petabyte-scale geospatial data
The first architectural decision was storage structure. Data arriving from external sources in heterogeneous formats had to be normalized before it could be stored usefully, which meant the preprocessing layer had to be designed before the warehouse.
Ideas2IT built
- a geo data warehouse on AWS S3 with a preprocessing stage that ran deduplication, timezone conversion, and geospatial partitioning as the ingestion entry point.
- Every record arriving in the system passed through this normalization pipeline before landing in the warehouse.
- AWS EMR handled the distributed transformation workload at petabyte scale, partitioning geospatial records to make downstream analysis both faster and cheaper to run.
This Phase Produced
- Geo data warehouse on AWS S3
- Structured storage layer with partitioned schema for geospatial records
- Automated preprocessing pipeline
- Deduplication, timezone conversion, and partitioning at ingestion
- AWS EMR transformation cluster
- Distributed compute for geo-partitioning at petabyte scale
- Apache Sedona geospatial analysis layer
- Framework enabling cost-effective spatial queries on S3-partitioned data
- Redshift Spectrum integration
- Query layer surfacing partitioned S3 data to analysis workloads
- Spark optimization configuration
- Pipeline tuning for compute efficiency and cost reduction
Phase 02
Replacing raw-data queries with a structured, cost-efficient analysis layer
Spark Optimization and Cost Architecture: reducing per-query compute costs by 80% across the research workload
With the warehouse and preprocessing pipeline in place, the cost problem was still structural: geospatial queries at petabyte scale are expensive if they run against unoptimized data.
The team configured Apache Sedona running on AWS EMR to query against the S3-partitioned warehouse rather than raw inputs, which eliminated the compute overhead that made large spatial queries unsustainable.
Redshift Spectrum gave research teams SQL-accessible query access to the partitioned S3 layer without loading data into Redshift itself. Spark optimizations across the full pipeline, covering job configuration, partition sizing, and execution planning, drove the 80% project budget reduction and made 300TB preprocessing achievable within five days.
This Phase Produced
- Apache Sedona on EMR — optimized configuration
- Geospatial query framework tuned for partitioned S3 warehouse
- Redshift Spectrum query access layer
- SQL access to S3-partitioned geospatial data without Redshift ingestion
- Spark job optimization framework
- Partition sizing, execution planning, and pipeline efficiency configuration
- Cost reduction architecture
- Pipeline design producing 80% savings vs. unoptimized raw-data query approach
- Geocode and date-range retrieval capability
- Queryable warehouse enabling combined spatial and temporal data retrieval
The Outcome
300TB in five days and 80% cost savings: geospatial research analytics made viable at petabyte scale.
The 80% cost reduction and five-day preprocessing result were a direct consequence of architectural sequencing: normalizing data at ingestion before any query touched it, then placing an optimized geospatial query layer over a partitioned warehouse instead of raw data. Apache Sedona on AWS EMR running against partitioned S3 is structurally less expensive than unoptimized spatial queries at petabyte scale. The cost followed from the architecture.