For those not already aware,
@nickwh8te shared a well-written thread on data availability. There’s no need to break what’s not broken or reinvent the wheel, but there’s something missing from his buffet analogy — that being: inspecting the buffet or in our example the tomatoes on Data Availability ≠ Data Storage ≠ Data Validity
🥫
#DataStorage is like canned food.
�
� #DataAvailability is an all-you-can-eat buffet.
🕵️
♀️ #DataValidity is the quality inspector.
In Nick's explanation, "Data storage solves the need to preserve data for future retrieval, similar to canning food. It's sealed and stored for later consumption."
This analogy emphasizes the ease of accessing stored data multiple times without additional charges or data loss. But here's the problem: a can is not transparent.
So, if I tell you that I have put 4 red tomatoes in this can, you have to trust me that that’s correct.
Data availability is a solution to the problem of needing to make data accessible for anyone on the internet to download.
This is similar to an all-you-can-eat buffet. The idea is to lay out all the food (or, data), allowing anyone to come and eat as much as they desire. On the other hand, data availability sampling is akin to having someone ensure that all dishes at the buffet are fit for consumption by sampling a bit from each dish, or, in our case, each tomato 😋
But, as Nick says, "food at a buffet spoils quickly because it's left out, so the food might not be edible later.”
Data availability layers don't guarantee perpetual access to downloadable data; they are only available for a limited duration. Sustaining this data requires running an archival node si
nce #DPoS
or #PoS networks lack incentives for such retention. Consequently, this often leads to centralization, creating a singular point of failure along with risks of data inconsistency and corruption.
Essentially, foundations, validators, and third-party providers typically operate these archival nodes. While some foundations might provide free access, it's not consistently available. Users usually incur expenses to query data from these nodes, resulting in trust concerns, high costs, and substantial time investments for syncing if they opt for a local node setup.
Hence, the significance of storing historical chain data in incentivized environments such as @ArweaveEco, @EthStora
ge, #BNBGreenfield, or @Filecoin. Plus, employing tools like @LighthouseWeb3 further enhances this storage approach.
This approach addresses two major issues:
1️⃣ Data Availability Sampling, thanks to products like @CelestiaOrg, @eigencloud & @AvailProject.
2️⃣ Data Storage, thanks to the storage layers mentioned above.
But there are other essential questions, resulting in these solutions missing a major step:
How do we ensure that we have, for example, exactly four red tomatoes in our can, and not three red and one green? How do we know that these tomatoes, even if red, are not poisoned (you won’t know if it’s bad or not until you’ve eaten it), or if a part of a tomato is missing? This is where @KYVENetwork comes in with Data Validation.
Just like in the food industry, where hygiene and quality inspectors manage these aspects, in our case, the tomatoes represent pieces of data, and the inspectors are validators. They ensure that builders, data scientists, or simple users who build on top of this data are working with a solid foundation, using trustless data that hasn't been tampered with or is inconsistent. Even better, data users can access these tomatoes for free.
It doesn't matter if the tomatoes are from field 1, 2, 3, or 4; a red tomato will always be a red tomato. Thus, we have tomatoes from different fields mixed into the same cans (these tomatoes are deterministic).
By having multiple data sources, @KYVENetwork ensures decentralized validation, bringing forward trustless data. If I had only one source of tomatoes, I could claim that my tomatoes are red, but in reality, they might be more greenish, and you would have to trust me.
Trustless data = multiple sources of the same results of data.
If the majority of validators, after comparing their sources of data with the data that has been uploaded, agree that the data is correct, we can use the data without any issue, confident that they remain accessible because they are “canned'.
We then know that this piece of data is accurate, or that this “can” contains four red tomatoes. If this is not the case, and the uploader has provided incorrect data, then they are penalized and get slashed, similar to how providing toxic or green tomatoes would result in fines.
To keep track of the valid pieces of data, we store the hash of the uploaded data on the decentralized ledger, which is the KYVE Blockchain. Just like keeping the serial number of each can of tomatoes that’s been checked. It’d be like having a transparent can, allowing everyone to verify that there are really four red tomatoes inside.
Why would we keep whole tomatoes instead of just pieces of tomatoes or tomato sauce directly? Like how a data sampling layer does with data?
The answer is simple: with whole tomatoes, you can create tomato pieces, dried tomatoes, tomato sauce, ketchup, soup, etc. Essentially, you can use the tomatoes in any way you want.
It’s the same with storing raw data from the genesis block. You can use this data as you wish and transform it according to your needs. You don’t need to download all the data at once, but rather in bundles (groups of blocks).
In our case, it's like a can containing tomatoes numbered from 4 to 8, or 1500 to 1504, for example. You can use them to cook without needing to buy all the cans from one field just to use one can.
Tomatoes are just one example, but in fact, this method would work with any kind of deterministic information. This means that @KYVENetwork can validate, in a fully decentralized way, any type of deterministic data from both Web3 and Web2.
Don’t trust, verify.