This research takes a deliberate approach, driven by a specific set of questions, to investigate how ETH is distributed and how that distribution changes. The study is structured in two key parts. First, we examine the current landscape of ETH holdings, providing a snapshot of how ETH is presently allocated across different accounts. Second, a time-based analysis tracks how ETH distribution evolves, taking into account market conditions and the impacts of major events like “the Merge” and the Shanghai update.

Gathering and Preparing Data

To ensure a comprehensive view, this study pulls data from multiple sources to build a complete picture of ETH supply distribution. We primarily rely on the following three sources:

Historical Data (Santiment, Dune)

We gathered past data on ETH supply distribution from Santiment, a platform specializing in cryptocurrency market intelligence. This dataset covers the entire history of the Ethereum blockchain, from July 2015 to December 2024, with records taken daily. It includes wallet information, such as ETH balances and the number of active addresses, along with corresponding price data.

The data is grouped into categories based on ETH balance sizes, starting with very small amounts and increasing tenfold with each category. These range from wallets holding 0-0.001 ETH to those holding billions of ETH. For each balance range, the data shows what percentage of the total ETH supply is held within those wallets. We consolidate the smallest and largest balance categories following a method described by Urquhart (2022) (see Fig. 4).

Furthermore, we extract daily ETH holdings for specific types of wallets: wrapped ETH (WETH) smart contracts, centralized exchanges (CEXs), decentralized exchanges (DEXs), bridge protocols, and lending platforms. Because standard ETH cannot directly interact with smart contracts, it needs to be converted into the ERC-20 format through a process called “wrapping.” This locks the ETH into a WETH contract, making it usable within the decentralized finance (DeFi) ecosystem. Bridge protocols lock ETH on Ethereum and create equivalent tokens on other blockchains, enabling transfers between different blockchain networks. From Dune Analytics, a blockchain analysis platform, we gather data on the amount of ETH being staked on the Beacon Chain and subsequently withdrawn.

Individual Account Data (Google BigQuery)

For a detailed understanding of ETH ownership, we access datasets on Google BigQuery, which provides access to extensive Ethereum blockchain data. We use this data to calculate the ETH balance for over 98 million unique Ethereum addresses. This high level of detail allows us to analyze ETH distribution among holders and understand overall supply patterns. Our approach involves accessing the “crypto_ethereum” dataset within Google BigQuery, which contains tables detailing transaction traces, blocks, and individual transactions. We use a Structured Query Language (SQL) query to extract and process this information in stages. First, we gather the necessary information from the trace, block, and transaction tables, making sure to capture all relevant transaction details. We then filter the data to include only confirmed transactions. Next, we create a double-entry bookkeeping model to track the flow of ETH into and out of each address. We track all transactions where addresses either send or receive ETH, also accounting for transaction fees by using data from the block and transaction tables. This bookkeeping system allows us to aggregate transaction values and calculate the net balance for each address in monthly intervals, up to February 2024. For all subsequent analyses, we only include wallets with a balance of at least 0.0001 ETH, in order to focus on active wallets. This cutoff is essential because, on the blockchain, wallets can be created but not deleted; including all wallets would skew the data, as many are inactive or empty. This leaves us with a dataset of 92 million wallets, holding approximately 99.99% of all ETH.

Identifying Entities (WalletLabels, Etherscan)

We use labels from WalletLabels.xyz and Etherscan.io to distinguish between different participants on the Ethereum network. These labels allow us to categorize addresses, assigning them types such as “smart contracts” and “exchange wallets,” and identifying specific entities like “Binance” where possible. This provides labels for a total of over 42 million addresses.

Data Integration and Analysis

The next step is to combine the preprocessed data from all of the sources to create a unified dataset.

We incorporate the entity labels (c) into the detailed account data (b) to add context and depth to our analysis. This allows us to differentiate between different kinds of addresses and understand their relationships with entities like exchanges. Of our 92 million addresses, 10.88 million are assigned a label. Table 1 shows the frequency of the top 15 label types.

Table 1 Frequency of Label Types. This table shows the percentage of wallets with each label type.

Analysis Framework

Status Quo Assessment

When analyzing the statistical distribution of ETH, we excluded wallets with labels. As shown in Table 1, labeled wallets typically represent entities or mechanisms (like smart contracts, exchanges, or liquidity pools) rather than individual users. Focusing on unlabeled wallets gives a more accurate picture of ETH distribution among ordinary users. Because blockchain systems are pseudonymous, we cannot definitively link wallets to specific individuals. Therefore, all our distribution analyses describe how ETH is distributed across wallets, not necessarily across users. However, the true distribution of ETH ownership is likely more decentralized than our findings suggest. This is because smaller retail investors often use centralized services and exchanges, where their funds are pooled into large deposit wallets. Conversely, wealthier users tend to prioritize security and use self-hosted or hardware wallets, suggesting that many large wallets may represent the combined holdings of multiple individuals, rather than a single entity (Nadler and Schär 2020).

We conducted a statistical analysis on Ethereum wallet balances, dividing the wallets into two groups: the bottom 99% and the top 1%. Separating wallets into these two segments is a common practice in studies of wealth distribution, allowing us to account for the distinct statistical properties of the main body and the extreme tail of the data (Clementi and Gallegati 2005). Wealth distributions often have “heavy tails” described by the Pareto principle, making it important to evaluate the bulk and tail of the distribution separately.

The dataset was sorted from lowest to highest based on wallet balance, and then divided into the top 1% (the wealthiest wallets) and the bottom 99%.

We applied all analyses independently to both groups. First, we fitted three statistical distributions – Pareto, log-normal, and Weibull minimum – to the data. These distributions were chosen because they are frequently used to model income and wealth data (Hlasny 2021).

The parameters for each distribution were estimated using the maximum likelihood estimation (MLE) method, and the quality of fit was assessed using log likelihood, the Akaike information criterion (AIC), and the Kolmogorov–Smirnov (KS) test (Goldstein et al. 2004). We also visually inspected the best-fitting distribution for each segment, plotting the empirical cumulative distribution function (CDF) alongside the CDFs of the fitted distributions and quantile-quantile (QQ) plots, which compare the quantiles of the data to the theoretical quantiles of the fitted distributions.

These plots allowed us to visually evaluate how well the distributions matched the characteristics of the data. Visual inspection is often recommended as it can reveal issues that statistical metrics might miss, especially since empirical data rarely perfectly conform to parametric distributions (Hlasny 2021).

Tracking Changes Over Time

This section investigates the changes in ETH distributions over time. It includes a visual analysis of how wallets cluster together or diverge during bull and bear markets, and examines how the staking mechanism introduced by the Shanghai upgrade influences these patterns.

We performed exploratory data analysis to examine raw blockchain data and identify underlying trends, anomalies, and relationships. This included visualizing trends over time, identifying significant events, and summarizing the data to establish a visual and quantitative foundation for further analysis. We used various graphical and statistical tools to reveal how ETH holdings and dynamics vary across different wallet types.

Several metrics are used to measure concentration. The Gini coefficient is a common measure of inequality, but its sensitivity to changes at the lower end of the distribution makes it less suitable for assessing risks in blockchain ecosystems. In traditional wealth distribution studies, the Gini coefficient is useful because it considers the entire spectrum of holdings, including both poverty and extreme wealth. However, in blockchain ecosystems, the primary concern is the concentration of funds at the top. The Gini coefficient’s focus on minor holdings reduces its relevance in this context. This problem is compounded by the continuous growth in the number of blockchain addresses, as new addresses can be created freely. Many of these addresses hold negligible amounts or are simply used as intermediaries for transactions. This artificially skews the Gini coefficient and overemphasizes inequality at the lower end of the distribution.

Moreover, there are other issues associated with relying solely on the Gini coefficient, which is why more comprehensive measures are needed to address inconsistencies in inequality distribution measurement (Blesch et al. 2022, Shen and Dai 2024). To maintain comparability with other studies, we continue to use the Gini index where appropriate, but we also use the Herfindahl–Hirschman index (HHI). Unlike the Gini coefficient, the HHI focuses on the squared shares of the largest entities, emphasizing top-heavy distributions. This makes it less sensitive to the number of negligible addresses and better reflects the risks associated with large concentrations of funds in a small number of wallets. The HHI has a long history of use in economics for measuring market concentration and is endorsed by antitrust authorities because it can assess the potential for monopolistic or oligopolistic control (Carlton 2010).

To track how the concentration of ETH holdings changes over time, we calculated the HHI for all unlabeled Ethereum wallets holding more than 0.0001 ETH. This threshold helped filter out empty or near-empty addresses that do not significantly affect concentration. In the context of decentralization and wealth distribution, the HHI highlights the risk of excessive centralization, which is a critical issue in Proof-of-Stake (PoS) systems where a small number of large stakeholders can exert disproportionate influence or threaten network security. We define the HHI as follows:

$${\text{HHI}}\,=\mathop{\sum }\limits_{i=1}^{N}{s}_{i}^{2},$$

(1)

where si is the share of the i-th wallet relative to the total ETH holdings. In blockchain networks, a higher HHI indicates that a small number of wallets hold a large portion of the tokens, which can compromise the security and fairness of the PoS mechanism. Conversely, a lower HHI indicates a more even distribution.

Accurately accounting for specific addresses is crucial for accurate analysis. For example, the Beacon Chain deposit contract functions as a one-way bridge. ETH deposited into the contract remains there, but when withdrawals occur, the withdrawn ETH is credited directly to the user’s account, without an outgoing transaction from the deposit contract. Therefore, simply measuring the total amount of ETH in the contract would be misleading; it is essential to subtract withdrawn ETH from the contract’s balance to avoid misinterpreting inequality trends.

In addition to these longitudinal measures of concentration, we analyzed market phases and distributions, recognizing that market conditions significantly impact wealth dynamics (Chiarella et al. 2006).

We also used causal inference analysis via the Peter and Clark Momentary Conditional Independence (PCMCI) method to examine ETH flows between different wallet groups. PCMCI helps us to identify temporal dependencies and directional relationships, providing insights into how ETH moves between DEXs, CEXs, staked ETH, and other wallet categories. Specifically, we analyzed changes in ETH holdings around significant protocol events, such as the transition to PoS and the Shanghai upgrade.

Finally, we examined concentration at the consensus layer, which poses the greatest risk to Ethereum, as a single entity controlling a critical threshold of stake could compromise the network. To this end, we measured the Gini coefficient over time for staked ETH for each major staking entity. Unlike the wallet-level analysis, the consensus layer does not suffer from address inflation, as the staking entities are well-defined and limited. We fully accounted for each entity’s holdings, making the Gini coefficient an appropriate measure for assessing inequality in this context.

We computed the Gini index G (Dorfman 1979) as follows:

$$G=\frac{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{j = 1}^{n}| {x}_{i}-{x}_{j}| }{2{n}^{2}\mu }$$

(2)

where n is the number of entities; xi and xj are the balances of the i-th and j-th entities, respectively; and μ is the mean balance.

The Gini index ranges from 0 (perfect equality, where each staking entity has the same amount of ETH) to 1 (perfect inequality, where one entity owns all the ETH). Tracking this measure over time highlights how staking power is concentrated among dominant entities.

Finally, we interpreted these statistical results within the context of Ethereum’s overall ecosystem. Beyond analyzing wallet cluster balances during different market cycles and key events, we incorporated insights from developments in the Ethereum protocol, relevant news articles, and opinions from experienced DeFi participants. This provided a comprehensive view of the patterns and dynamics within Ethereum.

Causal Inference Analysis using PCMCI

To analyze how ETH moves among key wallet groups in Ethereum, we conducted a causal inference analysis using a flow matrix and the PCMCI algorithm to discover causal relationships. This method is well-suited for complex time-series data and identifies causal links by testing conditional independencies (Runge et al. 2019).

We used the mean flow matrix to examine the directional flows of ETH between DeFi (Lending Protocols, DEXs, Bridges, and WETH), staked ETH, CEXs, smaller wallets (<100 ETH), and larger wallets (> 100 ETH). This method quantified ETH movement dynamics, highlighting which categories dominated flows and how these dynamics evolved over time: (1) before the Beacon Chain launch, (2) after the Beacon Chain launch but before the Shanghai upgrade, and (3) after the Shanghai upgrade.

The ETH balance time-series data were divided into three periods based on the Beacon Chain launch (December 1, 2020) and the Shanghai upgrade (April 12, 2023).

The weekly net changes in balances for each category were computed as follows:

$$\Delta {B}_{i}={B}_{i}^{t}-{B}_{i}^{t-1},$$

(3)

where \({B}_{i}^{t}\) is the balance of category i at time t. These net changes were used to reconstruct the flow matrix for each week by solving the optimization problem below.

Let us minimize the total sum of flows as follows:

$${\text{Minimize}}\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{n}{F}_{ij},$$

(4)

where Fij is the flow from category i to category j, and n is the total number of categories.

The inflows and outflows for each category must satisfy the observed net change as follows:

$$\mathop{\sum }\limits_{j=1}^{n}{F}_{ji}-\mathop{\sum }\limits_{j=1}^{n}{F}_{ij}=\Delta {B}_{i},\quad \forall i\in \{1,\ldots ,n\}.$$

(5)

The flows are constrained to be nonnegative as follows:

$${F}_{ij}\ge 0,\quad \forall i,j.$$

(6)

We used linear programming to solve the optimization problem.

The reconstructed flow matrices for each week were averaged across each period to compute the mean flow matrix as follows:

$${\bar{F}}_{ij}=\frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}{F}_{ij}^{t},$$

(7)

where T is the total number of weeks in the period. The resulting mean flow matrix provided a concise representation of ETH movement dynamics for each phase.

The PCMCI analysis used the same time series data as the flow matrix analysis. We made the data stationary by differencing it. An augmented Dickey–Fuller (ADF) test confirmed stationarity. Since we believed the Shanghai upgrade was a significant event, we tested the data for structural breaks at that date using the Chow test. We then split the dataset into two periods – before and after the Shanghai upgrade on April 12, 2023 – to assess changes in causal structures.

PCMCI operates in two main steps:

  1. 1.

    The PC algorithm identifies potential causal relationships by testing conditional independencies among variables with lagged dependencies up to a defined maximum lag.

  2. 2.

    The momentary conditional independence (MCI) algorithm refines these relationships by estimating partial correlations and p values, providing measures of causal strength and significance.

We used the generalized ParCorr (GPDC) test for conditional independence, which is particularly effective for handling nonlinear dependencies (Runge 2018). We set the maximum lag to 7 for the PCMCI algorithm because the correlation diminished beyond that point. The analysis was conducted with an alpha level of 0.01.

The causal relationships identified by PCMCI were visualized as directed graphs. Each node represents a variable, and edges indicate significant causal links. Edge weights show the strength of the causal effect, as measured by the MCI algorithm.

Contextual Interpretation

We interpreted the findings within the broader Ethereum ecosystem, taking into account market trends, participant behaviors, and protocol-level changes like the transition to PoS and the Shanghai upgrade. This contextual layer integrates external insights, including relevant research and expert opinions, to provide a comprehensive understanding of ETH distribution and dynamics.

Share.