Editor’s Note: This blogpost is a repost of the original content published on 5 March 2024, by Bidhan Roy and Marcos Villagra from Bagel. Founded in 2023 by CEO Bidhan Roy, Bagel is a machine learning and cryptography research lab building a permissionless, privacy-preserving machine learning ecosystem. This blogpost represents the independent view of these authors, whom have given their permission for this re-publication.
Trillion-dollar industries are unable to leverage their immensely valuable data for AI training and inference due to privacy concerns. The potential for AI-driven breakthroughs—genomic secrets that could cure diseases, predictive insights to eliminate supply chain waste, and chevrons of untapped energy sources—remain locked away. Privacy regulations also closely guard this valuable and sensitive information.
To propel human civilization forward in energy, healthcare, and collaboration, it is crucial to enable AI systems that train and generate inference on data while maintaining full end-to-end privacy. At Bagel, pioneering this capability is our mission. We believe accessing a fundamental resource like knowledge, for both human-driven and autonomous AI, should not entail a compromise on privacy.
We have applied and experimented with almost all the major privacy-preserving machine learning (PPML) mechanisms. Below, we share our insights, our approach, and some research breakthroughs.
And if you’re in a rush, we have a TLDR at the end.
Privacy-preserving Machine Learning (PPML)
Recent advances in academia and industry have focused on incorporating privacy mechanisms into machine learning models, highlighting a significant move towards privacy-preserving machine learning (PPML). At Bagel, we have experimented with all the major PPML techniques, particularly those post differential privacy. Our work, positioned at the intersection of AI and cryptography, draws from the cutting edge in both domains.
First, we will delve into each of these, examining their advantages and drawbacks. In subsequent posts, we will describe Bagel’s approach to data privacy, which addresses and resolves the challenges associated with the existing solutions.
Differential Privacy (DP)
One of the first and most important techniques with a mathematical guarantee for incorporating privacy into data is differential privacy or DP (Dwork et al. 2006), addressing the challenges faced by earlier methods with a quantifiable privacy definition.
DP ensures that a randomized algorithm, A, maintains privacy across datasets D1 and D2—which differ by a single record—by keeping the probability of A(D1) and A(D2) generating identical outcomes relatively unchanged. This principle implies that minor dataset modifications do not significantly alter outcome probabilities, marking a pivotal advancement in data privacy.
The application of DP in machine learning, particularly in neural network training and inference, demonstrates its versatility and effectiveness. Notable implementations include adapting DP for supervised learning algorithms by integrating random noise at various phases: directly onto the data, within the training process, or during inference, as highlighted by Ponomareva et al. (2023) and further references.
The balance between privacy and accuracy in DP is influenced by the noise level: greater noise enhances privacy at the cost of accuracy, affecting both inference and training stages. This relationship was explored by Abadi et al. in (2016) through the introduction of Gaussian noise to the stochastic gradient descent (DP-SGD) algorithm, observing the noise’s impact on accuracy across the MNIST and CIFAR-10 datasets.
An innovative DP application, Private Aggregation of Teacher Ensembles (PATE) by Papernot et al. in (2016), divides a dataset into disjoint subsets, training networks on each without privacy, termed as teachers. These networks’ aggregated inferences, subjected to added noise for privacy, inform the training of a student model to emulate the teacher ensemble. This method also underscores the trade-off between privacy enhancement through noise addition and the resultant accuracy reduction.
Further studies affirm that while privacy can be secured with little impact on execution times (Li et a. 2015), stringent privacy measures can obscure discernible patterns essential for learning (Abadi et al. 2016). Consequently, a certain level of privacy must be relinquished in DP to facilitate effective machine learning model training, illustrating the nuanced balance between privacy preservation and learning efficiency.
Pros of Differential Privacy
The advantages of using DP are:
Effortless. Easy to implement into algorithms and code.
Algorithm independence. Schemes can be made independent of the training or inference algorithm.
Fast. Some DP mechanisms have shown to have little impact on the execution times of algorithms.
Tunable privacy. The degree of desired privacy can be chosen by the algorithm designer.
Cons of Differential Privacy
Access to private data is still necessary. Teachers in the PATE scheme must have full access to the private data (Papernot et al. 2016) in order to train a neural network. Also, the stochastic gradient descent algorithm based on DP only adds noise to the weight updates and needs access to private data for training (Abadi et al. 2016).
Privacy-Accuracy-Speed trade-off on data. All implementations must sacrifice some privacy in order to get good results. If there is no discernable pattern in the input, then there is nothing to train (Feyisetan et al. 2020). The implementation of some noise mechanisms can impact execution times, necessitating a balance between speed and the goals of privacy and accuracy.
Zero-Knowledge Machine Learning (ZKML)
A zero-knowledge proof system (ZKP) is a method allowing a prover P to convince a verifier V about the truth of a statement without disclosing any information apart from the statement’s veracity. To affirm the statement’s truth, P produces a proof π for V to review, enabling V to be convinced of the statement’s truthfulness.
Zero-Knowledge Machine Learning (ZKML) is an approach that combines the principles of zero-knowledge proofs (ZKPs) with machine learning. This integration allows machine learning models to be trained and to infer with verifiability.
For an in-depth examination of ZKML, refer to the work by Xin et al. in (2023). Below we provide a brief explanation that focuses on the utilization of ZKPs for neural network training and inference.
ZKML Inference
Consider an unlabeled dataset A and a pretrained neural network N tasked with labeling each record in A. To generate a ZK proof of N‘s computation during labeling, an arithmetic circuit C representing N is required, including circuits for each neuron’s activation function. Assuming such a circuit C exists and is publicly accessible, the network’s weights and a dataset record become the private and public inputs, respectively. For any record a of A, N‘s output is denoted by a pair (l,π), where l is the label and π is a zero-knowledge argument asserting the existence of specific weights that facilitated the labeling.
This model illustrates how ZK proves the accurate execution of a neural network on data, concealing the network’s weights within a ZK proof. Consequently, any verifier can be assured that the executing agent possesses the necessary weights.
ZKML Training
ZKPs are applicable during training to validate N‘s correct execution on a labeled dataset A. Here, A serves as the public input, with an arithmetic circuit C depicting the neural network N. The training process requires an additional arithmetic circuit to implement the optimization function, minimizing the loss function. For each training epoch i, a proof π_i is generated, confirming the algorithm’s accurate execution through epochs 1 to i-1, including the validity of the preceding epoch’s proof. The training culminates with a compressed proof π, proving the correct training over dataset A.
The explanation above illustrates that during training, the network’s weights are concealed to ensure that the training is correctly executed on the given dataset A. Additionally, all internal states of the network remain undisclosed throughout the training process.
Pros of ZKML
The advantages of using ZKPs with neural networks are:
Privacy of model weights. The weights of the neural network are never revealed during training or inference in any way. The weights and the internal states of the network algorithm are private inputs for the ZKP.
Verifiability. The proof certifies the proper execution of training or inference processes and guarantees the accurate computation of weights.
Trustlessness. The proof and its verification properties ensure that the data owner is not required to place trust in the agent operating the neural network. Instead, the data owner can rely on the proof to confirm the accuracy of both the computation and the existence of correct weights.
Cons of ZKML
The disadvantages of using ZKPs with neural networks are:
No data privacy. The agent running the neural network needs access to the data in order to train or do inference. Data is considered a parameter that is publicly known to the data owner and the prover running the neural network (Xing et al. 2023).
No privacy for the model’s algorithm. In order to create a ZK proof, the algorithm of the entire neural network should be publicly known. This includes the activation functions, the loss function, optimization algorithm used, etc (Xing et al. 2023).
Proof generation of an expensive computation. Presently, the process of generating a ZK proof is computationally demanding—-see for example this report on the computation times of ZK provers. Creating a proof for each epoch within a training algorithm can exacerbate the computational burden of an already resource-intensive task.
Federated Learning (FL)
In Federated Learning or FL we look to train a global model using a dataset that is distributed in multiple servers with local data samples but without each server sharing their local data.
In FL there is a global objective function that is being optimized which is defined as
𝑓(𝑥1,…,𝑥𝑛)=1𝑛∑𝑖=1𝑛𝑓𝑖(𝑥𝑖),\(f(x_1,\dots,x_n)=\frac 1 n \sum_{i=1}^n f_i(x_i),\)
where n is the number of servers, each variables is the set of parameter as viewed by the server i, and each function is a local objective function of server i. FL tries to find the best set of values that optimizes f.
The figure below shows the general process in FL.
Initialization. An initial global model is created and distributed by a central server to all other servers.
Local training. Each server trains the model using their local data. This ensures data privacy and security.
Model update. After training, each server shares with the central server their local updates like gradients and parameters.
Aggregation. The central server receives all local updates and aggregates them into the global model, for example, using averaging.
Model distribution. The updated model is distributed again with local servers and the previous steps are repeated until a desired level of performance is achieve by the global model.
Since local servers never share their local data, FL guarantees privacy over that data. However, the model being constructed is shared among all parties, and hence, its structure and set of parameters are not hidden.
Pros of FL
The advantages of using FL are:
Data privacy. The local data on the local servers are never shared. All computations are done locally, and there is no need of communication between them.
Distributed computing. The creation of the global model is distributed among local servers, thereby parallelizing a resource-intensive computation. Thus, FL is considered a distributed machine learning framework (Xu et al. 2021).
Cons of FL
The disadvantages of using FL are:
Model is not private. The global model is shared among each local server in order to do their computations locally. This includes the aggregated weights and gradients at each step of the FL process. Thus, each local server is aware of the entire architecture of the global model (Konečný et al. 2016).
Data leakage. Recent research indicates that data leakage remains a persistent issue, notably through mechanisms such as gradient sharing—see for example Jin et al. (2022). Consequently, FL cannot provide complete assurances of data privacy.
Trust. Since no proofs are generated in FL, every party involved in the process need to be trusted that their computation and parameters were computed as expected (Gao et al. 2023).
Fully Homomorphic Encryption (FHE)
At its core, homomorphic encryption permits computations on encrypted data. By “homomorphic,” we refer to the capacity of an encryption scheme to allow specific operations on ciphertexts that, when decrypted, yield the same result as operations performed directly on the plaintexts.
Consider a scenario with a secret key k and a plaintext m. In an encryption scheme (E,D), where E and D represent encryption and decryption algorithms respectively, the condition D(k,E(k,m))=m must hold. A scheme (E,D) is deemed fully homomorphic if for any key k and messages m, the properties E(k,m+m’)=E(k,m)+E(k,m’) and E(k,m*m’)=E(k,m)* E(k,m’) are satisfied, with addition and multiplication defined over a finite field. If only one operation is supported, the scheme is partially homomorphic. This definition implies that operations on encrypted data mirror those on plaintext, crucial for maintaining data privacy during processing.
In plain words, if we have a fully homomorphic encryption scheme, then operating over the encrypted data is equivalent to operating over the plaintext. We will write FHE to refer to a fully homomorphic encryption scheme. The figure below shows how an arbitrary homomorphic operation works over a plaintext and ciphertext.
The homomorphic property of FHE makes it invaluable in situations where data must remain secure while still being used for computations. For instance, if we possess sensitive data and require a third party to perform data analysis on it, we can rely on FHE to encrypt the data. This allows the third party to conduct analysis on the encrypted data without the need for decryption. The mathematical properties of FHE guarantee the accuracy of the analysis results.
FHE Inference
Fully Homomorphic Encryption (FHE) can be used to perform inference in neural networks while preserving data privacy. Let’s consider a scenario where N is a pretrained neural network, A is a dataset, and (E,D) is an asymmetric FHE scheme. The goal is to perform inference on a record a of A without revealing the sensitive information contained in a to the neural network.
The inference process using FHE begins with encryption. The data owner encrypts the record a using the encryption algorithm E with the public key public_key, obtaining the encrypted record a’ = E(public_key, a).
Next, the data owner sends the encrypted record a’ along with public_key to the neural network N. The neural network N must have knowledge of the encryption scheme (E,D) and its parameters to correctly apply homomorphic operations over the encrypted data a’. Any arithmetic operation performed by N can be safely applied to a’ due to the homomorphic properties of the encryption scheme.
One challenge in using FHE for neural network inference is handling non-linear activation functions, such as sigmoid and ReLU, which involve non-arithmetic computations. To compute these functions homomorphically, they need to be approximated by low-degree polynomials. The approximations allow the activation functions to be computed using homomorphic operations on the encrypted data a’.
After applying the necessary homomorphic operations and approximated activation functions, the neural network N obtains the inference result. It’s important to note that the inference result is still in encrypted form, as all computations were performed on encrypted data.
Finally, the encrypted inference result is sent back to the data owner, who uses the private key associated with the FHE scheme to decrypt the result using the decryption algorithm D. The decrypted inference result is obtained, which can be interpreted and utilized by the data owner.
By following this inference process, the neural network N can perform computations on the encrypted data a’ without having access to the original sensitive information. The FHE scheme ensures that the data remains encrypted throughout the inference process, and only the data owner with the private key can decrypt the final result.
It’s important to note that the neural network N must be designed and trained to work with the specific FHE scheme and its parameters. Additionally, the approximation of non-linear activation functions by low-degree polynomials may introduce some level of approximation error, which should be considered and evaluated based on the specific application and accuracy requirements.
FHE Training
The process of training a neural network using Fully Homomorphic Encryption (FHE) is conceptually similar to performing inference, but with a few key differences. Let’s dive into the details.
Imagine we have an untrained neural network N and an encrypted dataset A’ = E(public_key, A), where E is the encryption function and public_key is the public key of an asymmetric FHE scheme. Our goal is to train N on the encrypted data A’ while preserving the privacy of the original dataset A.
The training process unfolds as follows. Each operation performed by the network and the training algorithm is executed on each encrypted record a’ of A'. This includes both the forward and backward passes of the network. As with inference, any non-arithmetic operations like activation functions need to be approximated using low-degree polynomials to be compatible with the homomorphic properties of FHE.
A fascinating aspect of this approach is that the weights obtained during training are themselves encrypted. They can only be decrypted using the private key of the FHE scheme, which is held exclusively by the data owner. This means that even the agent executing the neural network training never has access to the actual weight values, only their encrypted counterparts.
Think about the implications of this. The data owner can outsource the computational heavy lifting of training to a third party, like a cloud provider with powerful GPUs, without ever revealing their sensitive data. The training process operates on encrypted data and produces encrypted weights, ensuring end-to-end privacy.
Once training is complete, the neural network sends the collection of encrypted weights w’ back to the data owner. The data owner can then decrypt the weights using his private key, obtaining the final trained model. He is the sole party capable of accessing the unencrypted weights and using the model for inference on plaintext data.
There are a few caveats to keep in mind. FHE operations are computationally expensive, so training a neural network with FHE will generally be slower than training on unencrypted data.
Pros of FHE
The advantages of using FHE are:
Data privacy. Third-party access to encrypted private data is effectively prevented, a security guarantee upheld by the assurances of FHE and lattice-based cryptography(Gentry 2009).
Model privacy. Training and inference processes are carried out on encrypted data, eliminating the need to share or publicize the neural network’s parameters for accurate data analysis.
Effectiveness. Previous studies have demonstrated that neural networks operating on encrypted data using FHE maintain their accuracy—see for example Nandakumar et al. (2019) and Xu et al. (2019). Therefore, we can be assured that employing FHE for training and inference processes will achieve the anticipated outcomes.
Quantum resistance. The security of FHE, unlike other encryption schemes, is grounded in difficult problems derived from Lattice theory. These problems are considered to be hard even for quantum computers (Regev 2005), thus offering enhanced protection against potential quantum threats in the future.
Cons of FHE
The disadvantages of using FHE are:
Verifiability. FHE does not offer proofs of correct encryption nor correct computation. Hence, we must rely on trust that the data intended for encryption is indeed the correct data (Viand et al. 2023).
Speed. Relative to conventional encryption schemes, FHE is still considered to be slow during parameter setups, encryption and decryption algorithms (Gorantala et al. 2023).
Memory requirements. The number of weights that need to be encrypted are proportional to the size of the network. Even for small networks, the RAM memory requirements are in the order of gigabytes (Chen et al. 2018), (Nandakumar et al. 2019).
We examined the four most widely used privacy-preserving techniques in machine learning, focusing on neural network training and inference. We evaluated these techniques across four dimensions: data privacy, model algorithm privacy, model weights privacy, and verifiability.
Data privacy considers the model owner’s access to private data. Differential privacy (DP) and zero-knowledge machine learning (ZKML) require access to private data for training and proof generation, respectively. Federated learning (FL) enables training and inference without revealing data, while fully homomorphic encryption (FHE) allows computations on encrypted data.
Model algorithm privacy refers to the data owner’s access to the model’s algorithms. DP does not require algorithm disclosure, while ZKML necessitates it for proof generation. FL distributes algorithms among local servers, and FHE operates without accessing the model’s algorithms.
Model weights privacy concerns the data owner’s access to the model’s weights. DP and ZKML keep weights undisclosed or provide proofs of existence without revealing values. FL involves exchanging weights among servers for decentralized learning, contrasting with DP and ZKML’s privacy-preserving mechanisms. FHE enables training and inference on encrypted data, eliminating the need for model owners to know the weights.
Verifiability refers to the inherent capabilities for verifiable computation. ZKML inherently provides this capability. DP, FL, and FHE would not provide similar levels of integrity assurance.
The table below summarizes our findings:
What’s Next 🥯
At Bagel, we recognize that existing privacy-preserving machine learning solutions fall short in providing end-to-end privacy, scalability, and strong trust assumptions. To address these limitations, our team has developed a novel approach based on a modified version of homomorphic encryption (FHE).
Our pilot results are extremely promising, indicating that our solution has the potential to revolutionize the field of privacy-preserving machine learning. By leveraging the strengths of homomorphic encryption and optimizing its performance, we aim to deliver a scalable, trustworthy, and truly private machine learning framework.
We believe that our work represents a paradigm shift in the way machine learning is conducted, ensuring that the benefits of AI can be harnessed without compromising user privacy or data security. As we continue to share more about our approach, we invite you to follow our progress by subscribing to the Bagel blog.
For more thought pieces from Bagel, follow out their blog here.
To stay updated on the latest Filecoin happenings, follow the @Filecointldr handle.
Disclaimer: This information is for informational purposes only and is not intended to constitute investment, financial, legal, or other advice. This information is not an endorsement, offer, or recommendation to use any particular service, product, or application.
The Filecoin community celebrated the first anniversary of the Filecoin Virtual Machine (FVM) launch on March 14, 2024. The FVM has brought programmability to Filecoin’s verifiable storage and opened up a unique DeFi ecosystem anchored around improving on-chain collateral markets. Liquid Staking, for example, as a subset of Filecoin DeFi, has hit over $500 million in TVL. As the network grows, several critical infrastructures across AMMs, Bridges, Oracles, and Collateral Debt Positions (CDPs) are coming together to propel DeFi expansion in 2024.
In this blog post, let’s take a look at the latest DeFi projects launched on top of FVM and provide a view into future areas of activity.
DeFi Developments on FVM
Automated Market Makers
Automated Market Makers (AMMs) connect Filecoin with other Web3 ecosystems, enabling on-chain swaps, deeper liquidity, and fresh LP opportunities.
Decentralized Exchanges: ✅
Recently, leading Decentralized Exchanges Uniswap v3 (via Oku.trade) and Sushi integrated with Filecoin by deploying on the FVM. Oku Trade’s interface enables Uniswap users to easily exchange assets and provide liquidity on Filecoin. With this, FVM developers can effortlessly access bridged USDC and ETH assets natively on the Filecoin network, broadening Filecoin’s reach. As a foundational DeFi primitive, DEXes also opens the floodgates for non-native applications to leverage Filecoin’s robust storage and compute hardware.
Interoperability Networks
Bridges: ✅
Bridges help bring liquidity into DEXs and AMMs on FVM. For developers building on FVM, Bridges connects Filecoin’s verifiable data with tokens, users, and applications on any chain, ensuring maximum composability for DeFi protocols. For this purpose, messaging, and token bridging solutions by Axelar and Celer were added to the Filecoin network immediately post-FVM launch.
Today, AMMs Uniswap v3 and Sushi along with several other DeFi applications are natively bridged to Filecoin with the help of cross-chain infrastructure enabled by Axelar and Celer.
Liquid Staking
Liquid Staking protocols have been the prime mover within Filecoin DeFi. They’ve played a vital role in growing and improving on-chain collateral markets. Today, nearly 17% of the total locked collateral (approx. 30 million FIL) by storage providers comes from FVM-based protocols such as GLIF (52%), SFT Protocol (10%), Repl (9%) and the rest (29%). These protocols have increased capital access to storage providers while simultaneously enabling better yield access to token holders. Read more to learn how Filecoin staking works.
GLIF Points: 🔜
GLIF, the leading protocol on Filecoin, has a TVL of over $250 million. To put this into context, this surpasses the largest Liquid Staking protocols on L1 chains like Avalanche. As of writing this (March 06, 2024), 32% of all FIL stakes into GLIF liquidity pools were deposited shortly after its announcement to launch GLIF points (on Feb. 28, 2024), a likely precursor to a governance token.
Typically, to participate in the rewards program, GLIF users will have to deposit FIL and mint GLIF’s native token, iFIL. Similarly, the SFT protocollaunched a points program in 2023 based on its governance token to incentivize community participation.
Overall, we look forward to how the gameplay of points, popular among DApps in Web3 ecosystems, will act as a catalyst to decentralize governance and incentivize participation for Filecoin’s DeFi DApps.
New Staking Models: 👀
The influx of protocols experimenting with new models to inject liquidity into the ecosystem hasn’t slowed down. Two projects worth mentioning are Repl and FILLiquid.
Repl.fiintroduces the concept of “repledging.” Under repledging, SP’s pledged FIL are tokenized into pFIL, Repl’s native token, and used for other purposes including earning rewards. Repleding essentially increases the utility of locked assets thereby reducing opportunity costs for SPs. In just a few months after launch, Repl’s TVL has soared past $30 million.
FILLiquid, currently on testnet, models the business of FIL lending for SPs on algorithm-based fixed fees instead of traditional interest rates. The separation of payouts from the duration of deposits is expected to nudge long-term pledging and borrowing activities from token holders and SPs respectively, saving costs and increasing efficiency.
Price Oracles
Oracles, services that feed external data to smart contracts, are critical blockchain infrastructure essential for DeFi applications to grow and interact with the real world.
Pyth Network: ✅
Pyth recently launched its Price Feeds on the FVM. The integration allows FVM developers to access more than 400 real-time market data feeds while exploring opportunities to build on top of Filecoin’s storage layer. DeFi apps benefit from Pyth’s low-latency, high-fidelity financial data coming directly from global institutional participants such as exchanges, market makers, and trading firms.
Filecoin is also supported by Tellor, an optimistic oracle that gives FVM-based applications access to price feed data.
Collateralized Debt Positions
As DeFi activity on Filecoin is climbing, Collateralized Debt Positions (CDPs) will add more dimensions for other decentralized applications to build on FVM.
Chymia.Finance: 🔜
Chymia is an upcoming DeFi protocol on FVM. With a growing number of Liquid Staking Tokens (LST) on Filecoin, CDPs will extend the utility of locked tokens by generating stablecoins. Through Chymia, holders of LST can generate higher yields while using it as collateral for deeper liquidity.
Ajna: 🔜
Ajna is a noncustodial, peer-to-pool, permissionless lending, borrowing, and trading system requiring no governance or external price feed to function. As a result, any ERC20 on the FVM will be able to set up its own borrow or lend pools, making it easier for new developers to build a utility for their protocols.
Payments
Adjacent to storage offering on Filecoin, the FVM allows developers to bind DeFi payments to real-world primitives on the network. Built intuitively, Filecoin’s core economic flows enable paid services to settle on-chain. Station and Saturn are two notable Filecoin services to have successfully leveraged FVM for payments.
Filecoin Station: ✅
Station is a downloadable desktop application that uses idle computing resources on Station’s DePIN network to run small jobs. Participants in the network are rewarded with FIL earnings. Currently, Station operates the Spark and the Voyager modules, both aimed at improving retrievability on the network. In February, roughly 1,900 Station operators were rewarded with FIL for their participation.
Filecoin Saturn: ✅
Saturn, a decentralized CDN network built on Filecoin, also leverages FVM for disbursing FIL payments to retrieval nodes on the network. In 2023, Saturn averaged over 2,000 earning nodes (retrieval providers on the network receiving FIL) for their services.
Decentralized Options
With growing liquidity, options are yet another emerging product in DeFi. Options facilitate the buying or selling of assets at a predetermined price on a future date, giving token holders protection against price volatility and an opportunity to speculate on market moves.
Thetanuts:✅
Currently, Thetanuts Finance, a decentralized on-chain options protocol supports Filecoin. The platform allows FIL holders to earn yield on their holdings via the covered call strategy. Thetanuts FIL-covered call vaults are cash-settled and work on a bi-weekly tenor.
Wallets
To use dApps on the FVM, users would be required to hold FIL in a f410 or 0x type wallet address. Over time, many Web3 wallets such as MetaMask, FoxWallet, and Brave have started supporting 0x/f410 addresses. MetaMask also supports Ledger. With this, it is possible to hold funds in a Ledger wallet and interact with FVM directly.
In addition, exchanges like Binance natively supporting the FEVM drastically reduce complexities for FVM builders. To learn more about the most recent wallet upgrades, visit the Filecoin TLDR webpage.
What’s Next?
The obvious near-term impact of various integrations across AMMs, Bridges, and CDPs is a fresh influx of liquidity into the Filecoin ecosystem. Liquidity begets deeper liquidity with an increase in the number and diversity of DeFi protocols on Filecoin. DeFi’s growing economy clubbed with more services coming on-chain and utilizing FVM for payments will overall increase the revenue and utility of the network. We expect this strong DeFi traction to scale Filecoin as an L1 ecosystem, with core services of storage and compute becoming the backbone of the decentralized internet.
To stay updated on the latest Filecoin happenings, follow the @Filecointldr handle.
Many thanks to HQ Han and Jonathan Victor for reviewing and providing valuable insights and to all the ecosystem partners and teams for their timely input.
Disclaimer: This information is for informational purposes only and is not intended to constitute investment, financial, legal, or other advice. This information is not an endorsement, offer, or recommendation to use any particular service, product, or application.
2023 marked significant shifts in technology and adoption for the Filecoin network. From the launch of the Filecoin Virtual Machine, to other developments across Retrievals and Compute, 2023 lay the foundation for Filecoin’s continued expansion. This blogpost will provide a summary of the notable milestones the Filecoin ecosystem reached in 2023, and in the later portion, growth drivers to watch for 2024.
TL;DR
2023 Retrospective:
Storage: Active deals reached 1,800 PiB, and storage utilization grew to 20%
FVM: FVM launch in March 2023 enabled FIL Lending (Staking) which supplied 11% of total collateral locked by Storage Providers; TVL broke USD 200M
Retrievals: Retrievability of Filecoin data improved, alongside notable releases from Saturn (3,000+ nodes, sub 60ms TTFB) and Station
Compute, AI and DePIN networks: Synergistic growth of Filecoin together with physical resource & compute networks
Web2 Enterprise Data Storage: Led by strengthened offerings by teams such as Banyan, Seal Storage, and Steeldome
Continued DeFi growth: DEXes, Oracles, CDPs, spurred by service revenue coming on-chain
2023 Retrospective
To recap, Filecoin enables open services for data, built on top of IPFS. While Filecoin initially focused on storage, its vision includes the infrastructure to store, distribute, and transform data. The State & Direction of Filecoin, Summarized blog post shared an initial framework for Filecoin’s key components. This framework will serve as an anchor for discussing 2023’s traction.
1) Storage Markets: Active storage deals reached 1,800 PiB with storage utilization of 20%
In 2023, Filecoin’s stored data volume grew dramatically to 1,800 PiB, marking a 3.8x increase from the start of the year. Storage utilization grew to 20% from 3%. Currently, Filecoin represents 99% market share of total data stored across decentralized storage protocols (Filecoin, Storj, Sia, and Arweave).
Growth in Active Storage Deals was driven by two factors:
1) Storing data was easier in 2023. Continued development across on-ramps such as Singularity.Storage, NFT.Storage, and Web3.Storage increased Web3 adoption. Singularity alone onboarded 180 plus clients and 270 PiB of data. This growth was enabled by advances in its S3 compatibility, data preparation, and deal making.
2) Large dataset clients grew exponentially in 2023. Over 1,800 large dataset clients onboarded datasets by the end of 2023, from an initial base of 500 plus clients. 37% of these clients onboarded datasets exceeding 100 TiB in storage size.
2) Retrievals: Greater reliability for Filecoin Retrievals, alongside releases from Saturn & Station
Filecoin’s retrieval capabilities were bolstered by improvements both in its tooling and offerings. Several teams, such as Titan, Banyan, Saturn and Station, are laying the groundwork for new use cases to be anchored into the Filecoin economy, including decentralized CDNs and hot storage.
Saturn: A Decentralized CDN
Saturn is a decentralized CDN network built on Filecoin, that seeks to address the need for application-level retrievals. The Saturn network currently has over 3,000 nodes distributed across the globe, enabling low-latency content regardless of location.
Distribution of Nodes: 35% in North America, 34% in Europe, 24% in Asia, 7% RoW Source: Saturn Explorer as of January 08, 2024
Across 2023, Saturn reduced its effective Time-to-First-Byte (median TTFB) to 60 milliseconds. This makes Saturn the fastest dCDN for content-addressable data, with TTFB remaining consistent across all geographies. Saturn was also capable of supporting 400 million retrieval requests on its busiest day of the year.
At the end of 2023, Saturn launched a private beta for customers (clients include Solana-based NFT platform Metaplex).
Station: A Deploy Target for Protocols (Enabling Spark Retrieval Checks)
Station, a desktop app for Filecoin, was launched in July 2023. Station is a deployment target for other protocols allowing DePIN networks, DA layers, and others to run on a distributed network of providers.
Station’s first module, Spark, is a protocol for performing retrieval checks on Storage Providers (SPs). Spark helps establish a reputational base for SP data retrievability, and supports teams looking to provide a hot storage cache for Filecoin. Since launch in Nov 2023, Spark has grown to 21 million daily jobs on 10,000 active nodes as of January 2024.
3) Filecoin Virtual Machine: The FVM launch introduced a new class of use cases for the Filecoin Network. Early DeFi adoption broke $200 million in TVL.
The Filecoin Virtual Machine (FVM) launched in March 2023 with the EVM being the first supported VM deployed on top. FVM brought Ethereum-style smart contracts to Filecoin, broadening the slate of services anchoring into Filecoin’s block space. Two areas of early adoption have been in liquid staking services (led by GLIF and other DeFi protocols) and micropayments via the FVM.
Liquid Staking
One of the core economic loops in the Filecoin economy is the process of pledging, where SPs put up collateral to secure capacity and data on the network. Prior to the FVM, borrowed Filecoin collateral was sourced through managed offerings from operators like Darma Capital, Anchorage, and CoinList. Post-FVM, roughly a dozen staking protocols have launched to grow Filecoin’s on-chain capital markets.
In aggregate, FVM-based protocols supply almost 11% of the total locked collateral (approx. 19 million FIL) on the network, giving yield access to token holders, and increasing the access to capital for hundreds of Filecoin SPs. From Filecoin’s collateral markets alone, the ecosystem has broken past 200 million in TVL.
Payments
Adjacent to the core storage offering on Filecoin, new services are being built that anchor into Filecoin’s block space. As mentioned in the Retrieval Markets section, two notable services (Station and Saturn) have actually started leveraging FVM for payments in 2023.
To date, Station users have completed 161 million jobs with more than 400 addresses receiving FIL rewards. Saturn averaged over 2,000 earning nodes in 2023 with 448,905 FIL disbursed to date.
4) Compute: Traction for Decentralized Compute Networks
Filecoin’s design enables compute networks to run synergistically on Filecoin’s Storage Providers. Sharing hardware with compute networks is also valuable to the Filecoin network: (1) sharing allows Filecoin to offer the cheapest storage by running side-by-side with compute operations, and (2) it brings additional revenue streams into the Filecoin economy.
Two key developments made running compute jobs on Filecoin nodes:
Sealing-as-a-service: Sealing-as-a-service is the process by which Storage Providers (SPs) can outsource production of sealed sectors to third-party marketplaces. This gives SPs greater flexibility in operations and reduces costs of sector production. One marketplace, Web3mine, has thousands of machines participating in its protocol offering cost savings to SPs of up to 70%. On top of the cost savings, the infrastructure built may also eventually benefit SPs by allowing them to leverage their GPUs for synergistic workloads (e.g. compute jobs)
Reduced Onboarding Costs:Supranational shipped proof optimizations reducing sealing server cost by 90% and overall cost of storage by 40%
On top of these developments, 2023 saw emerging compute protocols building in the Filecoin ecosystem. Two notable examples:
Distributed compute platform Bacalhau demonstrated real-world utility among Web2 and DeSci clients. Most recently, the U.S. Navy chose Bacalhau to assist them in deploying AI capabilities in undersea operations. Bacalhau is a platform agnostic compute platform intended to run on Web3 and Web2 infrastructure alike. Launched in November 2022, Bacalhau’s public network surpassed 1.5 million jobs and in some cases slashed compute costs by up to 99%
Source: Bacalhau
Up-and-coming compute networks likeIo.net allow ML engineers to access a distributed network of GPUs at a fractional cost of individual cloud providers. Io.net recently incorporated 1,500 GPUs from Filecoin SPs — positioning Filecoin providers to offer their services to Io.net’s customer base. Io.net has over 7,400 users since its launch in November 2023, serving 15,000 hours of compute to users.
2024 will be a critical growth year for Filecoin as groundwork laid in 2023 comes to fruition. Native improvements to storage markets, greater speed of retrievals, new levels of customizability & scalability brought by FVM and Interplanetary Consensus (IPC), all expand the universe of use cases that Filecoin can address.
In a Web3 climate where there is substantial attention on DePIN (and the tying of real world services with Web3 capabilities) these changes will be critical building blocks for even better services. Here are three themes to look for in 2024:
1) Synergies with Compute, AI and other DePIN networks
In 2024, foundational improvements to the network will substantially improve Filecoin’s ability to compose with other ecosystems.
Fast finality allows better cross-network interactions with app chains in other ecosystems (e.g. Cosmos, Ethereum, Solana).
Customizable subnets allow for novel types of networks to form on top of Filecoin such as general purpose compute subnets (e.g. Fluence) and storage pools (e.g. Seal Storage).
Hot storage allows for broader use case support including serving data assets for physical resource networks (e.g. WeatherXM/Tableland), caching data for compute networks (e.g. Bacalhau), and more.
This is all scratching the surface. As the Web3 space and DePIN category grows, Filecoin is well positioned to support new communities that form given its 9 EiB of network capacity and flexibility. There exists a sizable opportunity within physical resource networks producing high amounts of data, such as Hivemapper (over 100M km mapped), and Helium (1 million hotspots globally). Compute networks are also a likely growth area, given the backdrop of a GPU shortage (particularly for AI purposes) in traditional cloud markets.
Source: Messari
2) Focused Growth in Web2 Enterprise Data Storage
Web2 enterprise storage is a unique challenge for decentralized networks – requirements from these customers are not easily supported by most networks. Typical requirements from enterprise clients can include end-to-end encryption, certification for data centers, fast retrievals, access controls, S3 compatibility, and data provenance/compliance. Crucially, these requirements tend to differ across segments and verticals, which means that a level of adaptability is required. Filecoin’s architecture enables it to layer on support for the features these customers need.
A few teams worth keeping an eye on:
Banyan: Banyan simplifies how enterprise clients integrate with decentralized infrastructure by bundling hot storage, end-to-end encryption, and access controls, on top of a pool of high-performing storage providers. With the Filecoin network, Banyan provides content-addressable storage, which it plans to complement with hot storage proofs by utilizing FVM. This implementation makes Banyan compatible not only for enterprise, but DePIN and compute networks.
Seal: Seal has established itself as one of the best storage onramps in the Filecoin ecosystem, and is responsible for onboarding several key clients onto the network, such as UC Berkeley, Starling Labs, the Atlas Experiment, and the Casper Network. The team has been one of the driving forces in enterprise adoption to date, and most recently has achieved SOC 2 Compliance. In 2024, they plan on launching a subnet to enable a market for enterprise deals. On the back of their enterprise deal flow, they are positioned to bring petabytes of data into the network over the coming year via their new market.
Steeldome: Steeldome offers enterprise clients seeking data archival, backup and recovery with an alternative that is cost-competitive, efficiently deployed and scalable. It does so by combining Filecoin in its stack with Web2 technologies, allowing a fuller feature set to complement Filecoin’s cost-effective and secure archival storage. The Steeldome team has succeeded in onboarding clients across insurance, manufacturing, and media. In 2024, they plan to continue that trajectory, while offering a managed service for Storage Providers.
3) Greater On-chain DeFi activity
There is likely to be continued activity in the on-chain economy with an increase in the number and type of DeFi protocols on Filecoin.
The first protocols will increase service revenues (from Storage, to Retrievals, and Compute) coming on-chain. As previously described, more services are coming online in the Filecoin network, and are utilizing FVM for payments (e.g. Saturn, Station).
Key releases in 2023, including SushiSwap going live in Nov 2023, and the UniSwap community’s approval of integrating on FVM will lead to more diverse DeFi services coming on-chain. This will include CDPs (Collateralized Debt Positions), and Price Oracles (e.g. Pyth), among others.
Final Thoughts
In 2024, the Filecoin network will experience greater adoption, particularly by Compute, AI and DePIN networks, as well as targeted enterprise verticals. This adoption brings on-chain service revenue and supports the growth of DeFi activity beyond collateral markets. Continued improvements on storage markets, retrievability driven by hot storage proofs and CDN networks, as well as releases by FVM and IPC will enable the teams building on Filecoin to drive this next stage of growth.
To stay updated on the latest in DePIN and the Filecoin ecosystem, follow the @Filecointldr handle.
This blogpost is co-authored by Savan Chokkalingam and Nathaniel Kok on behalf of FilecoinTLDR. Many thanks to HQ Han and Jonathan Victor for reviewing and providing valuable insights and to all the ecosystem partners and teams for their timely input.
Disclaimer: This information is for informational purposes only and is not intended to constitute investment, financial, legal, or other advice. This information is not an endorsement, offer, or recommendation to use any particular service, product, or application.
Editor’s Note: This article draws heavily from David Aronchick’s presentation at the Filecoin Unleashed Paris 2023. David is the CEO of Expanso and former head of Compute-over-data at Protocol Labs which is responsible for the launch of the Bacalhau project. This blog post represents the independent view of the creator of the original content, who has given permission for this re-publication.
The world will store more than 175 zettabytes of data by 2025, according to IDC. That’s a lot of data, precisely 175 trillion 1GB USB sticks. Most of this data will be generated between 2020 and 2025, with an estimated compound annual growth of 61%.
The rapidly growing data sphere broadly poses two major challenges today:
Moving data is slow and expensive. If you attempted to download 175 zettabytes at current bandwidth, it would take you roughly 1.8 billion years.
Compliance is hard. There are hundreds of data-related governances worldwide which makes compliance across jurisdictions an impossible task.
The combined result of poor network growth and regulatory constraints is that nearly 68% of enterprise data is unused. That’s precisely why moving compute resources to where the data is stored (broadly referred to as compute-over-data) rather than moving data to the place of computation becomes all the more important, something which compute-over-data (CoD) platforms like Bacalhau are working on.
In the upcoming sections, we will briefly cover:
How organizations are currently handling data today
Propose alternative solutions based on compute-over-data
There are three main ways in which organizations are navigating the challenges of data processing today — none of which are ideal.
Using Centralized Systems
The most common approach is to lean on centralized systems for large-scale data processing. We often see enterprises use a combination of compute frameworks — Adobe Spark, Hadoop, Databricks, Kubernetes, Kafka, Ray, and more — forming a network of clustered systems that are attached to a centralized API server. However, such systems fall short of effectively addressing network irregularities and other regulatory concerns around data mobility.
This is partly responsible for companies coughing up billions of dollars in governance fines and penalties for data breaches.
Building It Themselves
An alternative approach is for developers to build custom orchestration systems that possess the awareness and robustness the organizations need. This is a novel approach but such systems are often exposed to risks of failure by an over-reliance on a few individuals to maintain and run the system.
Doing Nothing
Surprisingly, more often than not, organizations do nothing with their data. A single city, for example, may collect several petabytes of data from CCTV recordings a day and only view them on local machines. The city does not archive or process these recordings because of the enormous costs involved.
Building Truly Decentralized Compute
There are 2 main solutions to the data processing pain points.
Solution 1: Build on top of open-source compute-over-data platforms.
Instead of using a custom orchestration system as specified earlier, developers can use an open-source decentralized data platform for computation. Because it is open source and extensible, companies can build just the components they need. This setup caters to multi-cloud, multi-compute, non-data-center scenarios with the ability to navigate complex regulatory landscapes. Importantly, access to open-source communities makes the system less vulnerable to breakdowns as maintenance is no longer dependent on one or a few developers.
Solution 2: Build on top of decentralized data protocols.
With the help of advanced computational projects like Bacalhau and Lilypad, developers can go a step further and build systems not just on top of open-source data platforms as mentioned in Solution 1, but on truly decentralized data protocols like the Filecoin network.
What this means is that organizations can leverage decentralized protocols that understand how to orchestrate and describe user problems in a much more granular way and thereby unlock a universe of compute right next to where data is generated and stored. This switchover from data centers to decentralized protocols can be carried out ideally with very few changes to the data scientists’ experience.
Decentralization is About Maximizing Choices
By deploying on decentralized protocols like the Filecoin network, the vision is that clients can access hundreds (or thousands) of machines spread across geographies on the same network, following the same protocol rules as the rest. This essentially unlocks a sea of options for data scientists as they can request the network to:
Select a dataset from anywhere in the world
Comply with any governance structures, be it HIPAA, GDPR, or FISMA.
Run at the cheapest rates possible
The concept of maximizing choices brings us to what’s called “Juan’s triangle,” a term coined after Protocol Labs’ founder Juan Benet for his explanation of why different use cases will have (in the future) different decentralized compute networks backing them.
Juan’s triangle explains that compute networks often have to trade off between 3 things: privacy, verifiability, and performance. The traditional one-size-fits-all approach for every use case is hard to apply. Rather, the modular nature of decentralized protocols enables different decentralized networks (or sub-networks) that fulfill different user requirements — be it privacy, verifiability, or performance. Eventually, it is up to us to optimize for what we think is important. Many service providers across the spectrum (shown in boxes within the triangle) fill these gaps and make decentralized compute a reality.
In summary, data processing is a complex problem that begs out-of-the-box solutions. Utilizing open-source compute-over-data platforms as an alternative to traditional centralized systems is a good first step. Ultimately, deploying on decentralized protocols like the Filecoin network unlocks a universe of compute with the freedom to plug and play computational resources based on individual user requirements, something that is crucial in the age of Big Data and AI.
Follow the CoD working group for all the latest updates on decentralized compute platforms. To learn more about recent developments in the Filecoin ecosystem, tune into our blog and follow us on social media at TL;DR, Bacalhau, Lilypad, Expanso, and COD WG.
Editor’s Note: This blogpost is a repost of the original content published on 7 June 2023, by Luffistotle from Zee Prime Capital. Zee Prime Capital is a VC firm investing in programmable assets and early-stage founders globally. They call themselves a totally supercool and chilled VC (we tend to agree) investing in programmable assets, collaborative intelligence and other buzzwords. Luffistotle is an Investor at Zee Prime Capital. This blogpost represents the independent view of the author, who has given permission for this re-publication.
Table of Contents
History of P2P
Decentralized Storage Network Landscape
FVM
Permanent Storage
Web 3’s First Commercial Market
Consequences of Composability
Storage is a critical part of any computing stack. Without this fundamental element, nothing is possible. Through the continued advancement of computational resources, a great deal of excess and underutilized storage has been created. Distributed Storage Networks (DSNs) offer a way to coordinate and utilize these latent resources and turn them into productive assets. These networks have the potential to bring the first real commerce vertical into Web 3 ecosystems.
History of P2P
The history of real Peer-to-peer file sharing began to hit the mainstream with the advent of Napster. While there were early methods of sharing files on the internet before this, the mainstream finally joined with the sharing of MP3 files that Napster brought. From this initial starting point, the distributed systems world exploded with activity. The centralization within Napster’s model (for indexing) made it easy to shut down given its legal transgressions, however, it laid the foundation for more robust methods of file sharing.
The Gnutella Protocol followed this trailblazing and had many different effective front-ends leveraging the network in different ways. As a more decentralized version of the napstereqsue query network, it was much more robust to censorship. Even in its day, it experienced censorship. AOL had acquired the developing company Nullsoft, and quickly realized the potential, shutting distribution down almost immediately. However, it had already made it outside and was quickly reverse-engineered. Bearshare, Limewire, and Frostwire are likely the most notable of these front-end applications you may have encountered. Where it ultimately failed was the bandwidth requirements (a deeply limited resource at the time) combined with the lack of liveness and content guarantees.
What came next was Bittorrent. This presented a level-up due to the two-sided nature of the protocol and its ability to maintain Distributed Hash Tables (DHTs). DHTs are important because they serve as a decentralized version of a ledger that stores the locations of files and is available for lookup by other participating nodes in the network.
After the advent of Bitcoin and blockchains, people started thinking big about how this novel coordination mechanism could be used to tie together networks of latent unused resources and commodities. What followed soon after was the development of DSNs.
Something that would perhaps surprise many people, is that the history of tokens and P2P networks goes back much farther than the existence of bitcoin and blockchains. What pioneers of these networks realized very quickly was a couple of the following points:
Monetizing a useful protocol you have built is difficult as a result of forking. Even if you monetize a front end and serve ads or utilize other forms of monetization, a fork will likely undercut you.
Not all usage is created equal. In the case of Gnutella, 70% of users did not share files and 50% of requests were for files hosted by the top 1% of hosts.
Power laws.
How does one remedy these problems? For BitTorrent it is seeding ratios (download/upload ratio), for others, it was the introduction of primitive token systems. Most often called credits or points they were allocated to incentivize good behavior (that promotes the health of the protocol) and stewardship of the network (like regulating content in the form of trust ratings). For a deeper dive into the broader history of all of this, I highly recommend these (now deleted, available via web archive) articles by John Backus:
Interestingly a DSN was part of the original vision for Ethereum. The “holy trinity” as it was called was meant to provide the necessary suite of tools for the world computer to flourish. Legend has it, that it was Gavin Wood’s idea for the concept of Swarm as the storage layer for Ethereum with Whisper as the messaging layer.
Mainstream DSNs followed and the rest is history.
Decentralized Storage Network Landscape
The decentralized storage landscape is most interesting because of the huge disparity between the size of the leader (Filecoin) and the other more nascent storage networks. While many people think of the storage landscape as two giants of Filecoin and Arweave, it would likely surprise most people that Arweave is the 4th largest by usage, behind Storj and Sia (although Sia seems to be declining in usage). And while we can readily question how legitimate the FIL data stored is, even if we handicapped it by say 90%, FIL usage is still ~400x Arweave.
What can we infer from this?
There is clear dominance in the market right now, but the continuity of this is dependent on the usefulness of these storage resources. The DSNs all roughly use the same architecture, node operators have a bunch of unused storage assets (hard drives), and they can pledge these to the network, mine blocks, and earn miner rewards for storing data. While the approaches to pricing and permanence may differ, the most important will be how easy and affordable retrieval and computation of the stored data is.
Fig 1. Storage Networks by Capacity and Usage
N.B:
Arweave Capacity is not directly measurable; instead, node operators are always incentivized to have sufficient buffer and to increase supply to meet demand. How big is the buffer? Given the immeasurability of it, we can not know.
Swarm’s actual network usage is impossible to tell, we can only look at how much storage has been paid for already. Whether it is used is unknown.
While this is the table of live projects, there are other DSNs in the works. These include ETH Storage, Maidsafe, and others.
FVM
Before going further it is probably worth noting that Filecoin has recently launched the Filecoin Ethereum Virtual Machine (FEVM). The FVM is a WASM VM that can support many different other runtimes on top via hypervisor. For instance, this recently launched FEVM is an Ethereum Virtual Machine runtime on top of the FVM/FIL network. The reason this is worth highlighting is that it facilitates the explosion of activity concerning smart contracts (i.e. stuff) on top of FIL. Before the March launch, there were 11 active smart contracts on FIL, following the FVM launch this has exploded. It benefits from composability in the form of leveraging all the work done in solidity to build out new businesses on top of FIL. This means innovations like quasi-liquid staking type primitives from teams like GLIF, and the various additional financialization of these markets you can build on top of such a platform. We believe this will accelerate storage providers because of the increases in capital efficiency (SPs need FIL to actively mine/seal storage deals). This differs from typical LSDs as there is an element of assessing the credit risk of the individual storage providers.
Permanent Storage
I believe Arweave gets the most airtime on this front, it has a flashy tagline that appeals to the deepest desires of Web 3 Participants:
Permanent Storage.
But what does this mean? It is an extremely desirable property, but in reality, execution is everything. Ultimately execution comes down to sustainability and cost for the end users. Arweave’s model is based on a pay-once, store forever (200 years upfront + deflation of storage value assumption) model. This kind of pricing model works well in a deflationary pricing environment of the underlying asset, as there is a constant goodwill accrual (i.e. old deals subsidize new deals) however the inverse is true in inflationary environments. History tells us this shouldn’t be an issue as the cost of computer storage has more or less been down only since inception but hard drive cost alone is not the whole picture.
Arweave creates permanent storage via the incentives of the Succinct Proof of Random Access (SPoRA) algorithm which incentivizes miners to store all the data and prove they can randomly produce a historical block. Doing so gives them a higher probability of being selected to create the next block (and earn the corresponding rewards).
While this model does a good job of getting node runners to want to store all of the data, it does not mean it is guaranteed to happen. Even if you set super high redundancy and use conservative heuristics to decide the parameters of the model, you can never get rid of this underlying risk of loss.
Fundamentally the only way to truly execute permanent storage would be to deterministically force somebody (everybody?) to and throw them in the gulag when they screw up. How do you properly incentivize personal responsibility such that you can achieve this? There is nothing wrong with the heuristic approach, but we need to identify the optimal way to achieve and price permanent storage.
All of this is a long-winded way of getting to the point of asking what level of security we deem acceptable for permanent storage, and then we can think about that pricing over a given time frame. In reality, consumer preferences will fall along the spectrum of replication (permanence), and thus they should be able to decide what this level is and receive the corresponding pricing.
In traditional investing literature and research, there is infamous knowledge about how the benefits of diversification work on the overall risk of a portfolio. While adding stocks initially brings risk reduction to your portfolio, very quickly the diversification benefits of adding stock become more or less not valuable.
I believe the pricing of storage over and above some default standard of replication on the DSN should follow a similar curve but for cost and security of the storage with an increasing amount of replication.
For the future, I am most excited about what more DSNs with easily accessible smart contracting can bring to the market for permanent storage. I think overall consumers will benefit the most from this as the market opens up this spectrum of permanence.
For instance, in the chart above we can think of the area in green as the area of experimentation. It may be possible to achieve exponential decreases in the cost of that storage with minimal changes to the number of replications and level of permanence.
Additional ways of constructing permanence could come from replication across different storage networks rather than just within a single network. These kinds of routes are more ambitious but naturally lead to more differentiated levels of permanence. The biggest question here would be is there some kind of “permanence-free lunch” we could achieve by spreading it across DSNs in the same way we diversify market risk across a portfolio of publicly traded equities?
The answer could be yes, but it depends on node provider overlap and other complex factors. It could also be constructed via forms of insurance, possibly by node runners subjecting themselves to higher levels of slashing conditions in exchange for these assurances. Maintaining such a system would also be extremely complex as multiple codebases and coordination between them are required. Nonetheless, we look forward to this design scape expanding significantly and forwarding the general idea of permanent storage for our industry.
Web 3’s First Commercial Market
Matti tweeted recently about the promise of storage as the use case to bring Web 3 some real commerce. I believe this is likely.
I was having a conversation recently with a team from a layer one where I told them it is their moral imperative to fill their blockspace as stewards of the L1, but even more than this, it is to do this with economic activity. The industry often forgets the second part of its name.
The whole currency part.
Any protocol that launches a token that would not like to be down only is asking for some kind of economic activity to be conducted in that currency. For layer 1s it’s their native token, processing payments (executing computation) and charging a gas fee for doing so. The more economic activity happening, the more gas is used, and the more demand for their token. This is the crypto-economic model. For other protocols, it is likely some kind of middleware SaaS service.
What makes this model most interesting is when it is paired with some kind of commercial good, in the case of classical L1s it is computation. The problem with this is that as it pertains to something like financial transactions, having variable pricing on the execution is a horrible UX. The cost of execution should be the least important part of a financial transaction such as a swap.
What becomes difficult is filling this blockspace with economic activity in the face of this bad UX. While scaling solutions are on the way that will help stabilize this (I highly recommend this whitepaper on Interplanetary Consensus warning PDF), the flooded market of layer 1s makes it difficult to find enough activity for a given one.
This problem is much more addressable when you pair this computational capacity with some kind of additional commercial good. In the case of DSNs, this is storage. The economic activity of data being stored and the related elements such as financing and securitization of these storage providers is an immediate filler.
But this storage also needs to be a functional solution for traditional businesses to use. Particularly those who deal with regulations around how their data is stored. This most commonly comes in the form of auditing standards, geographical restrictions, and making the UX simple enough to use.
We’ve discussed Banyan before in our Middleware Thesis part 2, but their product is a fantastic step in the right direction on this front. Working with node operators in a DSN to secure SOC certifications for the storage being provided while offering a simple UX to facilitate the upload of your files.
But this alone is not enough.
The content stored also needs to be easily accessible with efficient retrieval markets. One thing we are very excited about at Zee Prime is the promise of creating a Content Distribution Network (CDN) on top of a DSN. A CDN is a tool to cache content close to the users and deliver improved latency when retrieving the content.
We believe this is the next critical component to making DSNs widely adaptable as this allows videos to load quickly (think building decentralized Netflix, Youtube, TikTok, etc.). One proxy to this space is our portfolio company Glitter, which focuses on indexing DSNs. This is important because it is a critical piece of infrastructure to improve the efficiency of retrieval markets and facilitate these more exciting use cases.
The potential for these kinds of products excites us as they have demonstrated PMF with high demand in Web 2. Despite this adoption, many face frictions that could benefit from leveraging the permissionless nature of Web 3 solutions.
Consequences of Composability
Interestingly, we think some of the best alpha on DSNs is hiding in plain sight. In these two pieces by Jnthnvctr he shares some great ideas on how these markets will evolve and the products that will come (on the Filecoin side):
One of the most interesting takeaways is the potential for pairing off-chain computation in addition to storage and on-chain computation. This is because of the natural computational needs of providing storage resources in the first place. This natural pairing can add commercial activity to flow through the DSN while opening up new use cases.
The launch of the FEVM makes many of these level-ups possible and will make the storage space much more interesting and competitive. For founders searching for new products to build, there is even a resource of all the products Protocol Labs is requesting people to build with potential grants available.
In Web 2 we learned that data has a kind of gravitational pull, where companies that collected/created a lot of it can reap the rewards and were accordingly incentivized to close it off in such a way to protect that.
If our dreams of user-controlled data solutions become mainstream, we can ask ourselves how the point where this value accrual happens changes. While users become primary beneficiaries, receiving cash flows in exchange for their data, no doubt monetization tools that unlock this potential also benefit, but where and how this data is stored and accessed also changes dramatically. Naturally, this kind of data can sit on DSNs which benefit from the usage of this data via robust query markets. This is a shift from exploitation toward flow.
What comes after could be extremely exciting.
When we think about the future of decentralized storage, it is fun to consider how it might interact with operating systems of the future like Urbit. For those unfamiliar, Urbit is a kind of personal server built with open-source software that allows you to participate in a peer-to-peer network. A true decentralized OS to do self-hosting and interact with the internet is a P2P way.
If the future plays out the way Urbit maximalists might hope, decentralized storage solutions undoubtedly become a critical piece of the individual stack. One can easily imagine hosting all their user-related data encrypted on one of the DSNs and coordinating actions via their Urbit OS. Further to this, we could expect further integrations with the rest of Web 3 and Urbit, especially with projects such as Uqbar Network, which brings smart contracting to your Nook environment.
These are the consequences of composability, the slow burn continues to build up exponentially until it delivers something really exciting. What feels like thumb twaddling becomes a revolution, an alternative path towards existing in a hyper-connected world. While Urbit might not be the end solution on this front (it has its criticisms), what it does show us is how these pieces can come together to open up a new river of exploration.