Fundamentals
Essentials
Data Planet
Stellar System
Stellar Portal
Transplore™
Data Pipeline - To be released
Abstract
One of the core values of the Aralia Data Pipeline is the ability to continuously and consistently deliver high-value data to users across the ecosystem.
To ensure that the data is constantly updated, data planet administrators need to regularly convert and upload external data sources to the platform.
This process is a typical ETL (Extract - Transform - Load):
- Extract: Capture data from target sources (APIs, databases, files, web crawlers, etc.)
- Transform: clean, organize or convert to a uniform format (often handled by Python programs)
- Load: Load the results into Data Planet and make them accessible to users in the ecosystem.
To automate and sustain this process, Data Planet administrators can use Apache Airflow to manage and execute ETL.
Why Airflow
Apache Airflow is a set of workflow scheduling and routing tools widely used in the industry for data engineering and analysis platforms.
Airflow provides the following key capabilities in Aralia's data pipeline:
- Customizable Scheduling
- Flexibility to define daily, hourly, or any custom-scheduled data processing flow through DAG (Directed Acyclic Graph) design.
- Retries and dependency control are supported to ensure process execution stability.
- Integrate Data Sources and Transformation Processing
- Administrators can automatically read data sources through Python Operator or other built-in operators.
- Cleansing, transformation and conversion are performed directly in the Pipeline and converted to the destination format that conforms to the Data Planet.
- Process Observability and Error Tracking
- Provide Web UI to monitor the status of each task for easy troubleshooting and traceability.
- Supports Logging and Alerting to notify administrators of errors in a timely manner.
- Version Control and Reproducibility
- Each DAG and task definition is managed in code and supports version control (e.g. via Git).
- This ensures that the same pipeline can be reproduced or retraced to meet data governance requirements.
<aside>
📌
Simply put: Airflow's role at Aralia is to act as an automated scheduler and process coordinator, helping to keep the data planet on track with its ETL efforts, and ensuring that ecosystem users have access to up-to-date and consistent data in real time.
</aside>
Transform Example (Python)
Load Example
Aralia Data Planet provides an API for uploading data to Data Planet. See the following upload methods.
Aralia Data Planet Open API
← Previous Chapter: Transplore™