In my previous article Exporting and Analyzing Ethereum Blockchain I introduced the Ethereum ETL tool and provided step-by-step instructions for exporting Ethereum data to CSV files and analyzing it in Amazon Athena and QuickSight (also read Ethereum Blockchain on Google BigQuery).

In this article I will show you how to use AWS Data Pipeline and AWS AutoScaling to parallelize the exporting process to tens and hundreds of instances and reduce the exporting time to hours or even minutes, while keeping the costs low thanks to EC2 Spot Instances.

The whole process is divided into 4 steps:

Increasing AWS limits

Preparing a custom AMI

Creating a Data Pipeline

Creating an Auto Scaling Group

Increasing AWS Limits

AWS maintains service limits for each account to help guarantee the availability of AWS resources, as well as to minimize billing risks for new customers. Some service limits are raised automatically over time as you use AWS, though most AWS services require that you request limit increases manually. This process may take a few hours or days.

For this task you will need to increase:

Objects per Pipeline Limit in Data Pipeline: 1000. You only need to increase it if you want to export more than 4 million blocks in a single pipeline.

EC2 Spot Instances: 100. If your account is new they will only increase it to 20 or lower at first.

You can find detailed instructions on how to increase the limits here https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

Preparing a custom AMI

The requirements for a custom AMI for Data Pipeline are listed here https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-ami.html:

If you’re running Ubuntu, you will need to create an account named ec2-user by following these instructions https://aws.amazon.com/premiumsupport/knowledge-center/new-user-accounts-linux-instance/.

You will also need to add ec2-user to sudoers file. Open terminal window and type:

sudo visudo

In the bottom of the file, type the follow:

ec2-user ALL=(ALL) NOPASSWD: ALL

A few Ethereum ETL related checks:

Follow the instructions here to install Geth https://geth.ethereum.org/docs/install-and-build/installing-geth

Make sure python 3.5 or above is installed on the system and the python3 binary is in the PATH .

binary is in the . clone Ethereum ETL to /home/ec2-user/ethereum-etl



> cd ethereum-etl

> pip3 install -e . > git clone https://github.com/medvedev1088/ethereum-etl > cd ethereum-etl> pip3 install -e .

Make sure geth downloaded the blocks you want to export:

> geth attach

> eth.syncing

{

currentBlock: 5583296,

highestBlock: 5583377,

knownStates: 65750401,

pulledStates: 65729512,

startingBlock: 5268399

}

Make sure geth is launched on startup. The simplest way is to add it to crontab:

> echo "nohup geth --cache=1024 &" > ~/geth/start.sh && chmod +x ~/geth/start.sh

> crontab -e

@reboot /home/ec2-user/geth/start.sh >>/home/ec2-user/geth/crontab.log 2>&1

Download and configure DataPipeline TaskRunner. The instructions can be found here https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

After you’ve downloaded the jar file create ~/task-runner/start.sh file with the following content:

nohup java -jar /home/ec2-user/task-runner/TaskRunner-1.0.jar --config /home/ec2-user/task-runner/credentials.json --workerGroup=ethereum-etl --region=us-east-1 --logUri=s3://<your_bucket>/task-runner/logs --tasks 1 &

credentials.json file should contain your access and secret key for the account that has access to the S3 bucket:

{ "access-id":"MyAccessKeyID", "private-key": "MySecretAccessKey" }

Add it to crontab:

chmod +x ~/task-runner/start.sh crontab -e

@reboot /home/ec2-user/task-runner/start.sh >>/home/ec2-user/task-runner/crontab.log 2>&1

Create a new AMI and remember its ID. It will be used in the next step.

Creating a Data Pipeline

For creating the pipeline I used Troposphere https://github.com/cloudtools/troposphere, which is a Python library to create AWS CloudFormation descriptions.

Clone Ethereum Export Pipeline:



> cd ethereum-export-pipeline > git clone https://github.com/medvedev1088/ethereum-export-pipeline > cd ethereum-export-pipeline

Edit the file config.py and modify the block ranges that you want to export. By default the first 5 million blocks will be exported:

Generate CloudFormation template file:

> python3 generate_export_pipeline_template.py --output export_pipeline.template

Log in to the CloudFormation consolein N.Virginia region https://console.aws.amazon.com/cloudformation/home?region=us-east-1

Create a new stack by specifying the generated export_pipeline.template file. You will need to change the bucket name where CSV files will be uploaded. Optionally you can customize the Command field, e.g. you can remove parts of the script if you only need to export blocks, transactions or ERC20 transfers.

Log in to the Data Pipeline console in N.Virginia https://console.aws.amazon.com/datapipeline/home and make sure the pipeline is created. You can see it in the Architect View.

After you’ve created your pipeline it will be waiting for workers to start running the activities.

Creating an Auto Scaling Group

Sign in to AutoScaling console in your region https://console.aws.amazon.com/ec2/autoscaling/home

Create a Launch Configuration:

Choose the AMI that you created on the previous step.

Choose t2.medium instance type. (As AusIV noted here https://www.reddit.com/r/ethdev/comments/8oyjz8/how_to_export_the_entire_ethereum_blockchain_to/ c5.large may be a better choice, let me know if you tried it)

instance type. (As AusIV noted here https://www.reddit.com/r/ethdev/comments/8oyjz8/how_to_export_the_entire_ethereum_blockchain_to/ may be a better choice, let me know if you tried it) On the Configuration Details page check the box Request Spot Instances and specify the maximum price you’re willing to pay. It also shows you the current spot price for the selected instance type. The Spot price was 3 times lower than On-Demand price at the time I created my ASG.

Choose the Security Group and proceed with the creation wizard.

Create an Auto Scaling Group:

Choose the Launch Configuration you created on the previous step.

Specify the group size — how many instances you want to export the CSVs in parallel.

Proceed with the wizard.

After the ASG is created you can see it launch new instances in the Activity History tab:

You can check your Data Pipeline start to run the activities on the instances in the Execution Details page in Data Pipeline console, providing you with all the details and logs:

Each instance will run 10 activities at a time, unfortunately this number is not customizable.

The CSV files will be in the S3 bucket you specified when creating the pipeline.

You will need to manually remove the Auto Scaling group and the Data Pipeline Stack after the process is finished.

You may want to convert the CSVs to Parquet to optimize query performance. You can find the instructions here: Converting Ethereum ETL files to Parquet

Also read: