Layering geospatial datasets

Lisa Hutt

May 16, 2024

Layering geospatial data isn't just about stacking information—it's about discovering hidden insights that have the potential to transform your business. Picture a seamless blend of locations, roads, people and vehicle movements. When these diverse datasets intertwine with reliable links, a whole new world of understanding unfolds. By cross-referencing datasets, you're not only adding context but also bringing flat data files to life. Each data point can be contextualised with human mobility trends, demographics, environmental data, POI data and financial data. What was simply dots on a map evolves into valuable location intelligence, highlighting patterns of behaviour and trends over time.

Sydney mobility data overlayed with POI data

In addition to adding context, data layering can improve data efficacy. Cross referencing may enable blank fields to be filled in or inaccurate fields to be validated. This richer, more reliable datasource gives data professionals a stronger foundation for their analysis and more confidence in the results.

Geospatial datasets can be directly mapped by common fields such as latitude and longitude (often referred to as ‘same-as’) fields. Where there is also a common date/timestamp field between the two datasets, the joining key between the physical world can be both space and time. A spatial join is the method by which to bring together these different domains and layers of information. While many implementations may be straightforward, there are several challenges that may arise from layering data.

Challenges in joining geospatial datasets

‍

Nearest - sometimes coordinates may not directly map on top of each other. For example, the latitude and longitude of a delivery van may be slightly different to the delivery address if they could not park directly outside. Analysts need to create a ‘near-enough’ radius that they believe acceptable as a ‘delivered’ using code such as geopandas.sjoin_nearest
Automation vs heuristics - rules for determining a ‘true positive’ correlation for two objects in space and time are often deeply rooted in domain specific rules, and there can be many edge cases. There is often a tradeoff between automation and scalability vs accuracy and sensitivity to the domain
Not enough correlated data points - analysts often have to consider how many data points for each join candidate constitute a reliably accurate interaction.
Too many data points - It’s sometimes difficult to match up interactions where several may be densely clustered within a small area. There are different approaches to solving this problem. Where there is more than one potential match, you can use a confidence score and apply additional business rules and logic. Alternatively, take the first or most complete result and discard the rest, tagging it so that it can be filtered and not pollute ‘good data’.
No ground truth - especially when working with third party data, it can be impossible to establish ground truth labels for your spatiotemporal data. This can make traditional supervised machine learning techniques less practical / more costly to pursue
Time differences - While the delivery may be confirmed, GPS data points may be recorded at a different time. In this case, analysts need to specify an acceptable time window that constitutes likely delivery.
Different attributes - Where there may be differences in attributes between the fields to be joined in each dataset, it can be challenging for data professionals to create a ‘correlated relationship’. The UK geospatial commission offers up a suggested model for defining a correlation.
Data scale - Moving devices such as Vehicle GPS trackers and smartphones which ping multiple times a minute can create billions of records over a year. Joining and querying across these datasets on a regular basis takes time and causes latency between reality and insights. General System offers a solution for organizations that require geospatial or spatiotemporal reporting with high frequency and/or at scale.

Cross referencing real time streaming data with historic batch datasets

By merging live with historic data, companies can react instantly to operational shifts with real-time alerts powered by streaming data. Comparing current events with historic spatial and temporal behavior and trends gives added context that enables organizations to identify insights such as unproductive dwelling time, a surge in busyness and fraudulent activity as it occurs.

Get Started Free

Use case 1 - Logistics company

A logistics company with a fleet of vehicles wants to understand field productivity.

Ordinarily, it can be quite a challenge to apportion a percentage of the day to working on site, travelling between sites, loading tools at the depot, refuelling, breaks and unproductive idling which may have been caused by last minute cancellations.

Each vehicle is installed with a GPS tracker that transmits ping data of location and time. It may include speed, driver ID and other fleet metrics. Customer work orders include latitude and longitude alongside the address to help engineers locate exactly where a property is.

We can cross reference the GPS coordinates with customer data and warehouse locations to accurately calculate time at the depot, time onsite and travel time. By overlaying point of interest data, such as Overture, we can surface additional insights such as refuelling stops.

Use case 2 - Food delivery company

A food delivery company has a fleet of 1000s of couriers, each with a smartphone and mobile app which records their position. These are typically very noisy datasets. The company wishes to cross reference customer orders, with restaurant or supermarket pickup locations to balance supply and demand and confirm orders were delivered on time.

‍

The General System platform is very well suited for these types of workloads.

About the General System Platform.

At the core of the General System platform is a multi-attribute index and innovative storage features designed for large scale multi-dimensional analysis. Datasets, whether they are batch or live stream can be stored separately or ingested into a single table where the multi-attribute index on ID, latitude, longitude and time ordering enables exceptionally fast querying across the datasets, even at massive scale. No need to work with separate datasets, partition data, recluster and tune.

‍

The solution is ideal for organizations that have an ongoing need to cross reference historic or other datasets for context.

‍

If you have a requirement to join and query across geospatial data frequently or at scale, find out more.

‍Developers website

Demo

Get started free - request an API key

‍