What Do You Need for Exporting Ethereum History to S3 Buckets?
The foremost highlight in any guide on exporting Ethereum history into S3 buckets would focus on the plan for exporting. First of all, you need to come up with a clear specification of goals and requirements. Users must establish why they want to export the Ethereum history data. In the next step of planning, users must reflect on the effectiveness of exporting data by using BigQuery Public datasets. Subsequently, you must identify the best practices for efficient and cost-effective data export from the BigQuery public datasets.
The process for exporting full Ethereum history into S3 buckets could also rely on the naïve approach. The naïve approach focuses on fetching Ethereum history data from a node. At the same time, you must also think about the time required for complete synchronization and the cost of hosting the resultant dataset. Another important concern in exporting Ethereum to S3 involves serving token balances without latency concerns. Users have to reflect on possible measures for serving token balances and managing the uint256 with Athena. Furthermore, the planning phase would also emphasize measures for incorporating continuous Ethereum updates through a real-time collection of recent blocks. Finally, you should develop a diagram visualization for the existing state of the architecture for exporting approach.
Excited to learn the basic and advanced concepts of ethereum technology? Enroll Now in The Complete Ethereum Technology Course
Reasons to Export Full Ethereum History
Before you export the full Ethereum history, you need to understand the reasons for doing so. Let us assume the example of the CoinStats.app, a sophisticated crypto portfolio manager application. It features general features such as transaction listing and balance tracking, along with options for searching for new tokens for investing. The app relies on tracking token balances as its core functionality and used to rely on third-party services for the same. On the other hand, the third-party services led to many setbacks, such as inaccurate or incomplete data. In addition, the data could have significant lag with reference to the most recent block. Furthermore, the third-party services do not support balance retrieval for all tokens in a wallet through single requests.
All of these concerns invite the necessity to export Ethereum to S3 with a clear set of requirements. The solution must offer balance tracking with 100% accuracy along with the minimum possible latency in comparison to the blockchain. You must also emphasize the need to return the full wallet portfolio with a single request. On top of it, the solution must also include an SQL interface over blockchain data for enabling extensions, such as analytics-based features. Another quirky requirement for the export solution points to refraining from running your own Ethereum node. Teams with issues in node maintenance could opt for node providers.
You can narrow down the goals of the solutions to download Ethereum blockchain data to S3 buckets with the following pointers.
- Exporting full history of Ethereum blockchain transactions and related receipts to AWS S3, a low-cost storage solution.
- Integration of an SQL Engine, i.e. AWS Athena, with the solution.
- Utilize the solution for real-time applications such as tracking balances.
Curious to know about the basics of AWS, AWS services, and AWS Blockchain? Enroll Now in Getting Started With AWS Blockchain As A Service (BaaS) Course!
Popular Solutions for Exporting Ethereum History to S3
The search for existing solutions to export the contents of the Ethereum blockchain database to S3 is a significant intervention. One of the most popular exporting solutions is evident in Ethereum ETL, an open-source toolset useful for exporting blockchain data, primarily from Ethereum. The “Ethereum-etl” repository is one of the core elements of a broader Blockchain ETL. What is the Blockchain ETL? It is a collection of diverse solutions tailored to export blockchain data to multiple destinations, such as PubSub+Dataflow, Postgres, and BigQuery. In addition, you can also leverage the services of a special repository capable of adapting different scripts according to Airflow DAGs.
You should also note that Google serves as the host for BigQuery public datasets featuring the full Ethereum blockchain history. The Ethereum ETL project helps in collecting the public datasets with Ethereum history. At the same time, you should be careful about the process of dumping full Ethereum history to S3 with Ethereum ETL. The publicly available datasets could cost a lot upon selecting the query option.
Disadvantages of Ethereum ETL
The feasibility of Ethereum ETL for exporting the Ethereum blockchain database to other destinations probably offers a clear solution. However, Ethereum ETL also has some prominent setbacks, such as,
- Ethereum ETL depends a lot on Google Cloud. While you can find AWS support on the repositories, they lack the standards of maintenance. Therefore, AWS is a preferred option for data-based projects.
- The next prominent setback with Ethereum ETL is the fact that it is outdated. For example, it has an old Airflow version. On the other hand, the data schemas, particularly for AWS Athena, do not synchronize with real exporting formats.
- Another problem with using Ethereum ETL to export a full Ethereum history to other destinations is the lack of preservation of raw data format. Ethereum ETL relies on various conversions during the ingestion of data. As an ETL solution, Ethereum ETL is outdated, thereby calling for the modern approach of Extract-Load-Transform or ELT.
Excited to learn the basic and advanced concepts of ethereum technology? Enroll Now in The Complete Ethereum Technology Course
Steps for Exporting Ethereum History to S3
Irrespective of its flaws, Ethereum ETL, has established a productive foundation for a new solution to export Ethereum blockchain history. The conventional naïve approach of fetching raw data through requesting JSON RPC API of the public node could take over a week to complete. Therefore, BigQuery is a favorable choice to export Ethereum to S3, as it can help in filling up the S3 bucket initially. The solution would start with exporting the BigQuery table in a gzipped Parquet format to Google Cloud Storage. Subsequently, you can use “gsutil rsync’ for copying the BigQuery table to S3. The final step in unloading the BigQuery dataset to S3 involves ensuring that the table data is suitable for querying in Athena. Here is an outline of the steps with a more granular description.
Identifying the Ethereum Dataset in BigQuery
The first step of exporting Ethereum history into S3 starts with the discovery of the public Ethereum dataset in BigQuery. You can begin with the Google Cloud Platform, where you can open the BigQuery console. Find the datasets search field and enter inputs such as ‘bigquery-public-data’ or ‘crypto-ethereum’. Now, you can select the “Broaden search to all” option. Remember that you have to pay a specific amount to GCP for discovering public datasets. Therefore, you must find the billing details before proceeding ahead.
Exporting BigQuery Table to Google Cloud Storage
In the second step, you need to select a table. Now, you can select the “Export” option visible at the top right corner for exporting the full table. Click on the “Export to GCS” option. It is also important to note that you can export the results of a specific query rather than the full table. Each query creates a new temporary table visible in the job details section in the “Personal history” tab. After execution, you have to select a temporary table name from the job details for exporting it in the form of a general table. With such practices, you can exclude redundant data from massive tables. You should also pay attention to checking the option of “Allow large results” in the query settings.
Select the GCS location for exporting full Ethereum history into S3 buckets. You can create a new bucket featuring default settings, which you can delete after dumping data into S3. Most important of all, you need to ensure that the region in the GCS configuration is the same as that of the S3 bucket. It can help in ensuring optimal transfer costs and speed of the export process. In addition, you should also use the combination “Export format = Parquet. Compression = GZIP” to achieve the optimal compression ratio, ensuring faster data transfer to S3 from GCS.
Start learning about second-most-popular blockchain network, Ethereum with World’s first Ethereum Skill Path with quality resources tailored by industry experts Now!
After finishing the BigQuery export, you can focus on the steps to download Ethereum blockchain data to S3 from GCS. You can carry out the export process by using ‘gsutil’, an easy-to-use CLI utility. Here are the steps you can follow to set up the CLI utility.
- Develop an EC2 instance with considerations for throughput limits in the EC2 network upon finalizing instance size.
- Use the official instructions for installing the ‘gsutil’ utility.
- Configure the GCS credentials by running the command “gsutil init”.
- Enter AWS credentials into the “~/.boto” configuration file by setting appropriate values for “aws_secret_access_key” and “aws_access_key_id”. In the case of AWS, you can find desired results with the S3 list-bucket and multipart-upload permissions. On top of it, you can use personal AWS keys to ensure simplicity.
- Develop the S3 bucket and remember to set it up in the same region where the GCS bucket is configured.
- Utilize the “gsutil rsync –m . –m” for copying files, as it can help in parallelizing the transfer job through its execution in multithreaded mode.
In the case of this guide, to dump full Ethereum history to S3, you can rely on one “m5a.xlarge” EC2 instance for data transfer. However, EC2 has specific limits on bandwidths and cannot handle bursts of network throughput. Therefore, you might have to use AWS Data Sync service, which unfortunately relies on EC2 virtual machines as well. As a result, you could find a similar performance as the ‘gsutil rsync’ command with this EC2 instance. If you go for a larger instance, then you can expect some viable improvements in performance.
The process to export Ethereum to S3 would accompany some notable costs with GCP as well as AWS. Here is an outline of the costs you have to incur for exporting Ethereum blockchain data to S3 from GCS.
- The Google Cloud Storage network egress.
- S3 storage amounting to less than $20 every month for compressed data sets occupying less than 1TB of data.
- Cost of S3 PUT operations, determined on the grounds of objects in the exported transaction dataset.
- The Google Cloud Storage data retrieval operations could cost about $0.01.
- In addition, you have to pay for the hours of using the EC2 instance in the data transfer process. On top of it, the exporting process also involves the costs of temporary data storage on GCS.
Want to learn the basic and advanced concepts of Ethereum? Enroll in our Ethereum Development Fundamentals Course right away!
Ensuring that Data is Suitable for SQL Querying with Athena
The process of exporting the Ethereum blockchain database to S3 does not end with the transfer from GCS. You should also ensure that the data in the S3 bucket can be queried by using the AWS SQL Engine, i.e. Athena. In this step, you have to fix an SQL engine over the data in S3 by using Athena. First of all, you should develop a non-partitioned table, as the exported data does not have any partitions on S3. Make sure that the non-partitioned table points to the export data. Since AWS Athena could not handle more than 100 partitions simultaneously, thereby implying an effort-intensive process for daily partitioning. Therefore, monthly partitioning is a credible solution that you can implement with a simple query. In the case of Athena, you have to pay for the amount of data that is scanned. Subsequently, you could run SQL queries over the export data.
Exporting Data from Ethereum Node
The alternative method to export Ethereum blockchain history into S3 focuses on fetching data directly from Ethereum nodes. In such cases, you can fetch data just as it is from Ethereum nodes, thereby offering a significant advantage over Ethereum ETL. On top of it, you can store the Ethereum blockchain data in raw material and use it without any limits. The data in raw format could also help you mimic the offline responses of the Ethereum node. On the other hand, it is also important to note that this method would take a significant amount of time. For example, such methods in a multithreaded mode featuring batch requests could take up to 10 days. Furthermore, you should also encounter setbacks from overheads due to Airflow.
Excited to know about how to become an Ethereum developer? Check the quick presentation Now on: How To Become an Ethereum Developer?
The methods for exporting Ethereum history into S3, such as Ethereum ETL, BigQuery public datasets, and fetching directly from Ethereum nodes, have distinct value propositions. Ethereum ETL serves as the native approach for exporting Ethereum blockchain data to S3, albeit with problems in data conversion. At the same time, fetching data directly from Ethereum nodes can impose the burden of cost as well as time.
Therefore, the balanced approach to export Ethereum to S3 would utilize BigQuery public datasets. You can retrieve Ethereum blockchain data through the BigQuery console on the Google Cloud Platform and send it to Google Cloud Storage. From there, you can export the data to S3 buckets, followed by preparing the export data for SQL querying. Dive deeper into the technicalities of the Ethereum blockchain with a complete Ethereum technology course.
*Disclaimer: The article should not be taken as, and is not intended to provide any investment advice. Claims made in this article do not constitute investment advice and should not be taken as such. 101 Blockchains shall not be responsible for any loss sustained by any person who relies on this article. Do your own research!