
FAQ
Product
What is Data Flow Index (DFI)?
DFI is a novel technology that combines features of a data store with multi-attribute indexing, to create a solution that enables users to efficiently process massive, real-time datasets. Users can ingest and query the data simultaneously, and so extract more value, faster.
The first iteration of DFI enables spatiotemporal (3D and time) data and it is available now. Advanced features include complex spatial queries such as points-in-polygon, proximity and (soon) nearest neighbours; and aggregation analysis.
Support for additional, ie. non-spatial data types will be available soon in DFI.
Is DFI a database?
Whilst DFI is much like a database, and carries out many of the same functions, we do not call it one at this stage of our development. It is designed to supplement your database, and can be integrated with other components such as a streaming database, a Data Lake or a Data Warehouse to deliver end to end solutions
Is the DFI a message broker like Kafka?
There are many similarities between DFI and Kafka. For instance, Kafka does not replace the downstream pipeline, specifically it does not replace the need to have a database. The same is currently true for DFI.
However there are also key differences:
- While Kafka does store data, it is not designed natively to support spatio-temporal queries. DFI supports spatio-temporal queries effectively.
- Kafka is not designed as a data store for massive volumes, whereas DFI can ingest, index, store and query data at scale. .
- DFI isn’t designed to support messages. It can not guarantee the delivery of a message across a distributed architecture, as it is not aimed to solve this problem.
What type of queries can you run in DFI?
- All points in polygon or in bounding box
- Count of points in polygon or in bounding box
- Unique sensors in polygon or in bounding box
What type of data can be stored in DFI?
DFI can hold spatial or spatiotemporal data. However it will really shine and outperform any other solution when working with data that has the following characteristics:
- It is spatiotemporal in nature, in other words, it models entities that move through space over time
- There are high overall volumes of data that need to be indexed and queried, typically 10bn or more; or there is a need to ingest at a high throughput rate, i.e. 1m records or more per seconds
- The data ingested must be immediately queryable, with results available in real time
- Data needs to be queried with the well-known “points in polygon” framework. This means that given an area of interest (a building, a geographic feature, a country’s borders, …) the users need to identify what entities were in it at some point in time. This query framework enables a large number of use cases.
What does a DFI schema look like?
You can think of a DFI instance as a large table with the following columns:
- Geospatial Point (WGS84, 3-dimensional)
- Entity Id (up to 128 bits)
- Timestamp
- Optional Payload (document-structured data)
- Releasing additional non spatiotemporal schemas to allow querying across multiple data types is in our roadmap.
How much data can I store? (What data volumes can a DFI instance support?)
- We have successfully tested datasets with over 100 billion records on a single server
- It is stored on local disk, so the maximum size of the dataset is solely constrained by the hardware available
- DFI is designed to theoretically handle up to 1 trillion records
Do you offer professional engineering services support with setting up?
Yes, our customer engineers will be available to support DFI installation and setup.
What level of training will be needed to navigate the platform?
The main training is to familiarise with the API, which will follow Open API documentation https://www.openapis.org/.
We will build an admin console as a web application for monitoring. The index is self-managed.
Is DFI open source? Will you open source it?
No, we do not have any plans to open source DFI
How long does it take for the ingested data to be queryable?
Each record can be queried immediately after ingestion individually.
Out of Order Events
Currently DFI assumes the data is approximately temporally ordered at ingestion, and is designed with the assumption that some events may arrive late. Modestly out-of-order events, such as some events being delayed, will have small to no performance impact depending on the size and frequency of delays. At the limit, purely random temporal ordering will adversely impact query performance of temporal constraints.
Ingestion in purely random temporal order (maximum temporal disorder) does not affect performance of geospatial and entity search.
Do you support rollups and summaries?
These can be generated from the underlying raw data but are not computed automatically currently, however they are on our roadmap.
Is DFI fault tolerant?
At the moment, in case of failure the data in DFI needs to be reloaded
Is there going to be a mechanism to suggest improvements or features?
Yes, we will be using a client facing feature of ProdPad to allow users to submit feature ideas. This will be accessible from the web api documentation and the admin console.
Technology
How does DFI fit in my architecture?
Our customers tend to use one of the following consumption models
- In the first model, the customer has a very large stream of spatiotemporal data. With DFI they can:
- (1) filter data to efficiently process what is relevant downstream
- (2) create new real time apps that can interrogate the data immediately (e.g. logistics customers can query real time movement of stocks)
- (3) raise real time event to trigger new processes

- In the second model, the customer has a large data lake or warehouse of spatiotemporal data, which today they cannot efficiently leverage. With DFI they can:
- (1) filter data to efficiently process what is relevant downstream
- (2) build interactive applications or dashboard to discover new insights and value from their data

Do you offer the DFI as SaaS?
Yes we do and use AWS as our underlying provider.
Is there an on-premise offering as our core data is not available on a public cloud?
DFI is designed to be able to be deployed on-premise or at the edge in future, and these are part of our long-term roadmap.
I am on Azure or GCP, can I use DFI?
We only support AWS at the moment.
What happens when DFI index is at its full capacity?
DFI will enter into “read only” mode, ingestion of new data is suspended but data can be queried
What spatial reference systems are supported for geospatial queries?
We use the WGS84 coordinate system globally. All intersection and distance operations are computed on a sphere.
I don’t need to ingest VAST amounts of data (e.g. <1 BN records) but I still need to be able to query diverse types of data in real-time, can I still do this with you?
Yes you can. You will still benefit from low latency queries and from being able to ingest and query data without delay; and from potentially lower running costs.
Is the data persisted? Can I use the DFI as my source of truth?
It is not currently persisted, but enabling persistence is on our roadmap.
Where does the data have to be stored or pulled from to be able to be queried in DFI?
To be queried, the data must be ingested in the index. For optimal performance the server expects the disc storage to be local discs.
What is the secret to high performance?
- DFI index-organises data models across multiple attributes using a compressive high-dimensionality data structure that can directly represent complex data types and relationships in the ingest stream.
- DFI contains a storage engine purpose-built for petabyte-scale storage densities, adaptive resharding, and high-dimensionality access methods. These capabilities are not commonly found in open source but are critical for spatiotemporal workloads and the type of indexing used.
What level of training will be needed to navigate the platform?
The main training is to familiarise with the API, which will follow Open API documentation https://www.openapis.org/. We will build an admin console as a web application for monitoring. The index is self managed.