This is the source code used in the research paper:
An Empirical Analysis of Anonymity in Zcash George Kappos, Haaroon Yousaf, Mary Maller, Sarah Meiklejohn, 27th USENIX Security Symposium 2018, https://arxiv.org/abs/1805.03180
All authors are supported by the EUH2020 TITANIUM project under grant agreement number 740558.
Please read this README.md
from start to finish before attempting the analysis.
- Docker
- At least 3x storage space of the current blockchain
- Clone this repository
- CD into the root of this repository
zcash-empirical-analysis
The container directories of the data store directories must equal the value stores in the
research/config.py
file. This is paramount, as this config is used during analysis
If you would like to increase the blockheight that this analysis is performed upon,
you can do so by changing the integer value of blockheight
in the research/docker/config.py
file
-
Create three directories for each of the containers and add them to the
docker-compose.yml
file.zcash
:zcash
container, stores raw blockchain data downloaded by the zcash nodepgdata
:postgres
container, stores parsed Zcash blockchain dataresearch
:research
andpostgres
container, stores created parquet files and analysis data
-
The directories are used as volumes, they are mounted on their respective docker containers
<local_directory>:<container_directory>
e.g.
/data/data1/zcash/.zcash:/root/.zcash
/data/data1/zcash/.zcash
is the<local_directory>
, a folder on the host machine/root/.zcash
is the<container_directory>
, the above folder but mounted in the docker container -
Do not change the container directory, simply change the
<local_directory>
as required
-
Copy the
zcash-client/docker/zcash.conf.backup
file to the.zcash
folder created earlier as.zcash/zcash.conf
-
Configure the
zcash.conf
file by setting therpcuser
andrpcpassword
variables -
Set the same values to the
RPC_USER
andRPC_PASSWORD
variables in theresearch/docker/config.py
andzcashpostgres/docker/config.py
-
Set the variable
rpcclienttimeout
to120
, e.g.rpcclienttimeout=120
Sections (Heuristics and Clustering) within the analysis files require user address and tags in csv
format. Due to the sensitivity of this data, we do not provide such
addresses. Instead, the following files must be created in
research/docker/addresses/
and, if desired, filled with addresses. The
analysis can still be run without them being filled, but not if they are
not present.
All csv
files expect the ;
as the delimeter. All addresses are expected
to be tAddresses (transparent addresses used in Zcash)
-
pool_addresses.csv
- The addresses and tags in this file should be related to mining pools, which can be found on mining pool websites
- Double-column csv file with header
address;tag
, where each address has a single string as a tag (e.g.,tAddressA;poolX
) - Tags can be repeated but tAddresses must be unique
-
address_tags.csv
- The addresses and tags in this file can be related to any entity, and could be collected manually or scraped online
- Double-column csv file with header
address;tag
, where each address has a single string as a tag (e.g.,tAddressA;exchangeX
) - Tags can be repeated but tAddresses must be unique
-
founders_addresses.csv
- These are the addresses of the Zcash founders, which can be found in the source code and whitepaper
- Single-column csv file with header
address
, where each address must be on a separate line
-
Build and run the three containers by executing the following commands in the
zcash-empirical-analysis
folderdocker-compose build docker-compose up -d
-
This creates and runs the
zcash
,zcashpostgres
andresearch
containers with a network between them so they can interact
Container Interaction
-
To connect to the docker-node run the command below
docker exec -it <containername> bash
-
You must wait for the Zcash node to sync to an appropriate height of the blockchain, you can check this by executing the command via the script
./zcash-client/docker/cli.sh <command> <arguments>
which will execute the command on the zcash-cli interface and return the results
Note: If you get a
permission denied error
thenchmod +x
the script file -
Once the Zcash node is synced, the postgres database can be populated with data
-
Login into the
zcashpostgres
node using the docker command above and run the following -
To setup the database and instantiate the tables Note: This will erase all previous data
cd $SCRIPTS python setup.py
-
To parse the Zcash node data into postgres
cd $SCRIPTS python zcash_extraction.py
Note: This command will take 1 hour per 1,500 blocks. It will parse all the available data on the node. If re-run it will start from the last block committed in postgres If you cannot get a
connection refused
error, please check that therpcallowip
in thezcash.conf
has been correctly set to the range used in the docker network
-
-
Once the above steps are complete you may continue
-
First ensure the research container is running, this can be done by executing the command below you should see a container called
research
with a running or up statusdocker-compose ps
Note: The research container may fail to run if the Apache Spark download fails, if this happens then check if the spark download link in the
research/Dockerfile
is active, if it isn't active then please replace this link with a url from the Apache Spark mirror -
To do the analysis the container requires that the Zcash blockchain is parsed as Apache Spark Parquet files. On docker this used method A, but other methods would use less storage space. Due to time constraints we could not resolve the Spark issues.
- Method A: slower, requires less RAM, however takes up more disk space.
-
First run the following in the
zcashpostgres
container, these will createcsv
files from Postgres and store them in/root/research
psql -U postgres -d zcashdb -c "Copy (Select * From transactions) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.transaction.csv psql -U postgres -d zcashdb -c "Copy (Select * From vin) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.vin.csv psql -U postgres -d zcashdb -c "Copy (Select * From vout) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.vout.csv psql -U postgres -d zcashdb -c "Copy (Select * From vjoinsplit) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.vjoinsplit.csv psql -U postgres -d zcashdb -c "Copy (Select * From coingen) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.coingen.csv psql -U postgres -d zcashdb -c "Copy (Select * From blocks) To STDOUT With CSV HEADER DELIMITER ';';" > /root/research/public.block.csv
-
Next run the following in the
research
container, this will createparquet
files from thecsv
filescd $SCRIPTS python createParquetFromCSV.py
-
Delete the
csv
files when the above command is completerm -rf $RESEARCH/public.transaction.csv rm -rf $RESEARCH/public.vin.csv rm -rf $RESEARCH/public.vout.csv rm -rf $RESEARCH/public.vjoinsplit.csv rm -rf $RESEARCH/public.coingen.csv rm -rf $RESEARCH/public.block.csv
-
- Method A: slower, requires less RAM, however takes up more disk space.
All research commands must be run in the research
container.
All results are saved to the /root/research
mounted folder.
Generates statistics containing: total number of blocks, transactions &
Types (shielded, deshielded, transparent, mixed, private). The result is saved to a text file root/research/initial_analysis.txt
cd $SCRIPTS
python initialAnalysis.py
Generates address-based statistics such as coins sent, coins received, current coins, transactions sent and received, blocks mined, estimated amount in pool.
cd $SCRIPTS
python addressStatistics.py
These are saved in the file addresses_values.csv
The results are saved as a list of rows in a pickled file address_values.pkl
Note: This script must be run before the address clustering as it is dependent on the calculated address values
The format of the rows in the list are as follows:
Row(
"address": "addressA"
"pool_recv": 0.0,
"pool_sent": 0.0,
"coingens_recv": 0,
"vouts_count": 0,
"vins_count": 0,
"txs_recv": 0,
"txs_sent": 0,
"recv": 0.0,
"sent": 0.0,
"no_txs_total": 0
)
It also generates separate rich list txt files for the top 10 addresses that:
- sent coins -
rich_list_top_10_sent.csv
- received coins -
rich_list_top_10_recv.csv
- current balance -
rich_list_top_10_value.csv
The analysis consists of the files heuristicsGraphs.py
and plotGraphs.py
which create the
data for our results and plot the graphs of the paper respectively.
Our analysis consists of the graphs 2, 4, 5, 6, 8a, 8b, 8c, 9 as shown in the paper as well as the heuristics 3, 4, 5
To run the analysis, login to the research
container and run the following
cd $SCRIPTS
python heuristicsGraphs.py 2 4 5 6 8a 8b 8c 9 h3 h4 h5
These will plot and save the following graphs in the folder $RESEARCH/Graphs
:
-
Graph 2 :
$RESEARCH/Graphs/TransactionTypes.pdf
-
Graph 4 :
$RESEARCH/Graphs/TotalValueOverTime.pdf
-
Graph 5 :
$RESEARCH/Graphs/Deposits-Withdrawals.pdf
-
Graph 6 :
$RESEARCH/Graphs/DepositsPerIdentity.pdf
-
Graph 8a :
$RESEARCH/Graphs/WithdrawalsPerIdentityNoHeuristic.pdf
-
Graph 8b :
$RESEARCH/Graphs/WithdrawalsPerIdentityHeuristicF.pdf
-
Graph 8c :
/$RESEARCH/Graphs/WithdrawalsPerIdentityHeuristicFM.pdf
-
Graph 9 :
$RESEARCH/Graphs/FounderCorrelation.pdf
The user can choose which graphs to produce and which heuristics to run.
This is done by specifying command line arguments to the script.py
file
The valid arguments are: 2 4 5 6 8a 8b 8c 9 h3 h4 h5
e.g. If the user to produce the graphs 4 and 5 run heuristic h5, they would run
cd $SCRIPTS
python heuristicsGraphs.py 4 5 h5
The results of heuristics 3, 4, 5 are stored in files founders_heuristic_addresses.csv
,
miners_heuristic_addresses.csv
and heuristic5.txt
:
-
founders_heuristic_addresses.csv
- single column csv file with the header 'address' - each address is on a separate line - these addresses are the founders addresses from the founders heuristic -
miners_heuristic_addresses.csv
- single column csv file with header 'address' - each address is on a separate line - these addresses are the miner addresses from the mining heuristic, -
heuristic5.txt
- Details about results in heuristic 5 -
miners_addresses.csv
- single column csv file with header 'address' - where each address is on a separate line and associated with a miner
Note: This task is dependent on the following files
These files must be provided by the user, the details of them can be found above
- Pool Addresses
pool_addresses.csv
- Founders Addresses
founders_addresses.csv
- Address tags
address_tags.csv
These files are automatically generated by the following tasks
- Address Statistics
address_values.pkl
, Generated in task Address Statistics - Miners Addresses
miners_addresses.csv
, Generated in task Heuristic - Founders Heuristic
founders_heuristic_addresses.csv
, Generated in task Heuristic - Miners Heuristic
miners_heuristic_addresses.csv
, Generated in task Heuristic
The clustering process builds, tags and produces statistics for the groups of addresses that have been used as inputs in the transaction. This is done by creating addresses as nodes in a graph, and joining them via edges if they had been used as inputs in a transaction. This is done from the first transaction to the last (depending on the latest block height). Once complete, the code extracts the clusters (connected components) of the graph and computes statistics based on the addresses contained within the clusters. Each address is associated with only one cluster. You may find that there are many clusters which have only one address, which means that this address had only ever been used as the sole input in a transaction.
To run the address clustering, execute the following in the research
container
cd $SCRIPTS
python heuristic1Clustering.py
The statistics produced are: cluster size (starting from 0), clustered addresses, tags for pools, miners, founders, miners heuristic and founders heuristic, size, coingens sent and received, pool sent and received, total amount of coins sent and received, transactions sent and received, number of distinct transactions within that cluster and tags.
This script outputs the results in the following files:
README-heuristic1.md
- statistics about the clusters and details about the addressescluster_stats.pkl
- a python pickled dictionary containing the clusters ranked in order by size, largest starting at 0, and the above statistics in dictionaryclusters_graph.pkl
- the graph used to generate the clustersclusters_stats.csv
- statistics but incsv
format
You can run all of the scripts within the analysis
by executing the following in the research
container
cd $SCRIPTS
./runAll.sh
- In the future you may want to re-run the experiment using a more up-to-date blockchain
- To do this do the following
- Ensure the
zcash
container is synced to the latest block height - Re-run the configuration steps, editing the
research/docker/config.py
file with a higher block height - Re-run the analysis, you may keep the manual csv files you previously created
- Ensure the
Below are two alternate methods that can be used to generate the parquet
files.
These require much less hard drive space, but use much more RAM.
There are issues that prevent the files being created due to the connection
between the spark workers in docker. Thus, these can be run outside of docker
fine, but has issues if run within docker.
-
Method B: faster but requires much more RAM on the machine, however due to issues with spark in docker and time constraints we were unable to get this to run on large block sizes. This runs fine if spark is run outside of docker.
-
Run the following in the
research
container, this will create Apache Sparkparquet
files directly from the postgres database and store them in/root/research
which is mapped to the folder above.cd $SCRIPTS python createParquetDirectlyFromPostgres.py
Note: This command can fail if there is not enough memory for the Spark instance
-
-
Method C: faster as it parses data from the node directly into memory, then into spark files. We found issues running this on large block sizes, but this runs well outside of a docker container.
-
Run the following in the
research
container, this will create Apache Sparkparquet
files directly from the Zcash node and store them in/root/research
which is mapped to the folder above.cd $SCRIPTS python parseFromNode.py
Note: This command can fail if there is not enough memory for the Spark instance
-
-
Method D: A much faster method and space-efficient method to parse the blockchain is to directly read the raw block data (blk0*.dat files) into parquet. This has been left for future work.