Introduction
One of our goals for The Graph is to extend its capabilities to additional data services that are core to the ecosystem or extend the capabilities The Graph is offering today to cover use cases that subgraphs are not well suited for. The first logical candidates for new data services are:
- Firehose — Needed for efficiently and reliably extracting blockchain data. In combination with the EVM executor that is also being worked on, this allows indexers to run instrumented full nodes instead of archive nodes, which in turn makes the ecosystem more flexible, independent and decentralized. Firehose is useful both for external consumers and indexers. Some indexers might specialize on running and serving Firehose as a service, while other indexers might specialize in indexing and serving subgraphs.
- Substreams — Needed for highly parallelized, fast processing of blockchain data. Substreams can replace and speed up heavy processing currently done in subgraphs and then feed into subgraphs, but they can also be consumed directly for use cases where the subgraph GraphQL API requires workarounds and inefficiencies, like streaming data pipelines and analytics. Some indexers might specialize in processing and serving substreams as a service, other indexers might specialize in subgraphs that consume these substreams.
- “Native” Blockchain APIs — **With hundreds of indexers ideally running their own blockchain nodes, it becomes very feasible to also serve standard blockchain APIs via The Graph.
Work Required
In order to support such data services on the network, two things are necessary:
- Protocol Support — This is covered in more detail in GIP-0041 A World Of Data Services.
- Like with subgraphs, consumers (developers or end-users) need to be able to discover Firehoses and substreams worth consuming or building on top of. Indexers need to allocate towards Firehoses and substreams to claim rewards on chain for the service that they provide to consumers. Different ways of dealing with inconsistent or bad data (an activity also referred to as arbitration) may be necessary for different types of data services.
- Most of the above exists today, but only for subgraphs. Some of the protocol contracts, especially those related to indexer rewards, are agnostic to the type of data service, but especially the contracts that enable discoverability and arbitration need to be expanded.
- Offchain Support — Different data services have different characteristics, especially when it comes to the interactions between client and indexer. Subgraphs and native blockchain APIs follow the typical HTTP request/response pattern. Firehose and substreams on the other hand involve long-lived connections and streaming. To account for these differences and the new data service APIs in general, a few components need to be defined, implemented and running offchain:
- Indexers
- Indexers need to be able to announce what Firehoses and substream endpoints they are serving and ideally what substreams / substream modules they have already preprocessed and to what extent. In the network, this is done through a combination of on-chain allocations and the indexers’ off-chain status API. These need to be extended to support new data service types.
- Indexers also need to announce prices for the different data services via their off-chain status API.
- Lastly, indexers currently have a single public, protocol-aware surface layer (indexer-service). This allows subgraph processing and querying (via graph-node) to be almost entirely ignorant of the protocol. This separation of concerns is useful for many reasons. However, unlike with subgraphs where all queries involve query fee receipts and therefore need to go through indexer-service, longer-lived connections such as with Firehose or substreams, could be designed differently and involve a separate setup procedure for the longer-lived connections for instance.
- Gateways
- Some data services, especially those that involve many short-lived requests, will require gateways for high quality of service, facilitating query fee microtransactions to indexers and for making some types of queries deterministic (consumer convenience). For these types of data services, the gateway software will need to be extended.
- Other data services, like Firehose and substreams, may not require constant re-evaluation of what the best indexer is to serve a long-lived stream of data.
- Clients
- Subgraphs are typically consumed from web or mobile dApps, via either GraphClient or any other GraphQL client. Data services like Firehose and substreams will mostly be consumed from other environments:
- Indexers will connect to them via graph-node to feed the data into subgraphs.
- Consumers will connect to them via command line tools, e.g. to feed into streaming data pipelines.
- Fewer consumers will consume Firehose and substreams directly from web or mobile dApps, but that’s not to say it shouldn’t be possible.
- Either way, for data services like Firehose and substreams the above means that we’ll need to do some work on the client side in order to:
- Identify indexers that can serve a Firehose for a specific chain, or a specific substreams endpoint for a specific chain, or even has a specific substream preprocessed and can continue serving it fast. With subgraphs, the gateways do this, with Firehose and substreams, this could move into the client. But it doesn’t have to necessarily.
- Connect to an indexer, send query fee receipts its way (potentially at setup time as well as over time in order to top up and keep the stream alive) and consume the data from it.
- Feed the substream data into other data pipelines.
Current Thinking On Firehose/Substreams Support
This is an overview of the design we put together at Como. It touches both on-chain allocations and offchain interactions for choosing indexers and facilitating query fee payments. The following focuses on Firehose and substreams specifically, not other potential data services.
V1 (Proposal)
Manifests
-
We’re assuming every Firehose is uniquely identified by
- A Firehose API version
- A chain ID that uniquely identifies the chain (e.g. Ethereum mainnet)
- A schema version for the data for that chain ID
-
This can be turned into a manifest, like with subgraphs
serviceType: firehose
serviceApiVersion: 1.1.0 # example
chainID: eip155:1 # ethereum mainnet
chainSchemaVersion: 1.0.0 # another example
-
This manifest in turn can be uploaded to IPFS and hashed into a Firehose ID (Qm...
)
-
The same principle works for substream services (serviceType
, serviceApiVersion
, chainID
; chainSchemaVersion
may not be necessary)
-
Note: The very same principle also works for native blockchain APIs like JSON-RPC.
Discovery & Collecting Query Fees