Cron Jobs Ka Problem — Raat Ko 2 Baje Pata Chalta Hai
Bahut saari teams ka yahi haal hota hai. Ek pipeline banana tha — SQL se data nikalo, Python se clean karo, final table mein load karo. Solution? Cron job lagao, scripts schedule karo. Done.
Lekin production mein yeh approach silently toot jaati hai. Ek script fail hoti hai — koi nahi jaanta. Dependency break ho jaata hai — data half-processed rehta hai. System restart hota hai — cron job dobaara start nahi hota. Subah office aate ho toh dashboard empty milta hai.
Cron job ek scheduler hai, orchestrator nahi. Yeh sirf ek script ko ek time pe chalata hai. Isse na pata hai ki pehli script complete hui ya fail hui. Na yeh retry karega. Na yeh alert bhejega. Na yeh dependencies track karega.
- ✗ Script fail hoti hai — koi alert nahi
- ✗ Automatic retry nahi hoti
- ✗ Task dependencies track nahi hoti
- ✗ Logs ek jagah nahi milte
- ✗ Partial failures silently pass ho jaati hain
- ✗ Scheduling conflicts aur race conditions
- ✓ Failure pe immediate alert — email, Slack, PagerDuty
- ✓ Configurable retry with backoff
- ✓ Task A complete hone ke baad hi Task B chalega
- ✓ Centralized logs aur visual DAG dashboard
- ✓ Partial failure pe sirf failed task retry hota hai
- ✓ Backfill support — past dates ka data reprocess karo
Data Pipeline Kya Hoti Hai?
Data Pipeline ek automated workflow hai jo data ko ek source se uthata hai, use process karta hai, aur ek defined destination pe load karta hai. Har step ek specific kaam karta hai — aur yeh sab ek defined order mein hota hai.
Database, API, S3 bucket, Kafka topic, ya koi bhi source se raw data pull karo. Yeh step batata hai ki data kahan se aayega — aur kaise connect karna hai.
Null values handle karo. Duplicate rows remove karo. Invalid formats fix karo. Agar yeh step skip karo toh downstream mein garbage data analysis corrupt kar deta hai.
Raw data ko meaningful format mein badlo. Aggregations, joins, calculations — jaise daily revenue by city, ya monthly active users per product category.
Expected row counts check karo. Schema validate karo. Anomaly detection run karo. Agar validation fail ho — pipeline rok do aur alert bhejo. Silently corrupt data load mat karo.
Processed data ko data warehouse (BigQuery, Snowflake, Redshift), database, ya dashboard mein load karo — ready for analysis aur reporting.
DAG — Directed Acyclic Graph Kya Hota Hai?
Jab pipeline mein multiple steps hote hain — aur kuch steps parallel mein chal sakte hain, kuch sequential hain — tab hum DAG use karte hain define karne ke liye ki konsa task kab chalega.
Directed — har task ka ek direction hai. Task A se Task B — backward nahi jaata.
Acyclic — koi cycle nahi. Task A → Task B → Task A nahi ho sakta — infinite loop nahi hoga.
Graph — tasks nodes hain, dependencies edges hain.
┌─────────────────────┐
│ START (Trigger) │
│ Daily @ 2:00 AM │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ extract_raw_sales │ ← Task A: S3 se raw CSV pull karo
└──────────┬──────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ clean_nulls │ │ remove_dupes │ │ fix_formats │ ← Parallel tasks
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
└───────────────┬┘ ─┘
│
▼
┌─────────────────────┐
│ transform_revenue │ ← City-wise, product-wise aggregation
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ validate_output │ ← Row count check, schema validation
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ load_to_warehouse │ ← BigQuery / Redshift mein load
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ notify_team_slack │ ← "Daily report ready" message
└─────────────────────┘
Is DAG mein cleaning steps parallel chalte hain — teeno simultaneously. Lekin transform_revenue tab tak nahi chalega jab tak teeno cleaning tasks complete nahi ho jaate. Yeh dependency tracking automatically handle hoti hai pipeline orchestrator ke through.
Agar validate_output fail ho jaaye — load_to_warehouse chalega hi nahi. Corrupt data warehouse mein pohunchega hi nahi. Aur team ko Slack pe alert milega ki validation fail hui.
Apache Airflow — Industry Standard Orchestrator
Apache Airflow Airbnb ne 2014 mein banaya tha apne internal data pipelines ke liye. 2016 mein open-source hua. Aaj yeh data engineering mein sabse widely used pipeline orchestration tool hai — Uber, Twitter, Lyft, Flipkart — sab use karte hain.
Airflow mein tum DAG ko Python code mein likhte ho. Yeh ek web UI deta hai jahan se tum apni pipelines monitor kar sakte ho, manually trigger kar sakte ho, ya failed tasks ko individually retry kar sakte ho.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
# Default settings — har task pe apply honge
default_args = {
'owner': 'data-team',
'retries': 3, # Failure pe 3 baar retry karega
'retry_delay': timedelta(minutes=5), # 5 minute wait karke retry
'email_on_failure': True, # Fail hone pe email
'email': ['data-team@company.com']
}
# DAG define karo
with DAG(
dag_id='daily_sales_pipeline',
default_args=default_args,
schedule_interval='0 2 * * *', # Har raat 2 baje
start_date=datetime(2025, 1, 1),
catchup=False # Past missed runs ignore karo
) as dag:
# Task 1 — Extract
extract_task = PythonOperator(
task_id='extract_raw_sales',
python_callable=extract_from_s3 # Tumhara Python function
)
# Task 2 — Transform
transform_task = PythonOperator(
task_id='transform_revenue',
python_callable=aggregate_by_city
)
# Task 3 — Load
load_task = PythonOperator(
task_id='load_to_warehouse',
python_callable=load_to_bigquery
)
# Dependency define karo — yahi DAG banata hai
extract_task >> transform_task >> load_task
# Extract complete → Transform start → Load start
Airflow powerful hai lekin setup heavy hai. Tumhe ek dedicated server chahiye. Ek Postgres ya MySQL database chahiye. Aur ek scheduler process alag se chalti hai. Large teams ke liye yeh worthwhile hai — lekin small projects ke liye over-engineering ho sakta hai.
Prefect — Modern Python-Native Alternative
Prefect ek newer orchestration tool hai jo Airflow ki complexity ko address karne ke liye banaya gaya. Prefect ka main pitch yahi hai: "Yeh sirf Python hai." Alag DAG syntax nahi seekhna. Jo Python tumhe pata hai, wahi use karo — bas decorators lagao.
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta
# @task decorator — is function ko pipeline task banao
@task(retries=3, retry_delay_seconds=60, cache_key_fn=task_input_hash)
def extract_raw_sales():
# S3 ya database se data pull karo
return fetch_data_from_s3("s3://bucket/sales/")
@task
def transform_revenue(raw_data):
# City-wise revenue aggregate karo
return raw_data.groupby("city")["revenue"].sum()
@task
def load_to_warehouse(transformed_data):
# BigQuery mein load karo
write_to_bigquery(transformed_data, table="sales.daily_revenue")
# @flow decorator — yeh tera pipeline entry point hai
@flow(name="Daily Sales Pipeline")
def daily_sales_pipeline():
raw = extract_raw_sales()
aggregated = transform_revenue(raw)
load_to_warehouse(aggregated)
if __name__ == "__main__":
daily_sales_pipeline() # Locally test kar sakte ho directly
Prefect mein tumhe koi alag configuration file nahi likhni. Python function likhte ho, decorator lagate ho — pipeline ready hai. Local pe bhi run kar sakte ho, production pe bhi. Testing bahut simpler hai kyunki yeh plain Python functions hain.
DAG-based orchestration. Heavy setup, rich UI, massive ecosystem. Large team ke liye best.
Pure Python decorators. Easy local testing, fast setup, cloud UI built-in. Growing adoption.
Spotify ne banaya. Simple dependency-based pipelines ke liye. Airflow jitna feature-rich nahi.
Infrastructure manage nahi karna. Cloud provider ke saath tight integration. Cost per usage.
Manual Cron Scripts vs Data Pipeline — Production Mein Fark
Dono approaches ek simple scenario mein kaam karti hain. Fark tab samajh aata hai jab pipeline complex hoti hai, data volume badhta hai, ya kuch unexpected fail hota hai at 2 AM.
CRON JOB APPROACH:
─────────────────────────────────────────────────────
step_1.py ✓ step_2.py ✓ step_3.py ✗ [SILENT]
│
└── Script exits with error code 1
No one knows.
step_4, step_5, step_6 — did they run?
Was data partially loaded?
Dashboard empty in the morning.
Manual investigation starts at 9 AM.
AIRFLOW / PREFECT PIPELINE:
─────────────────────────────────────────────────────
Task 1 ✓ Task 2 ✓ Task 3 ✗
│
├── Immediate Slack / Email alert: "Task 3 failed"
├── Retry attempt 1 (after 5 min) ✗
├── Retry attempt 2 (after 5 min) ✗
├── Retry attempt 3 (after 5 min) ✓ ← Fixed itself
│
├── Task 4, 5, 6 resume automatically
└── Full logs available in UI — exact error line
- Data Pipeline ek automated workflow hai jo data ko source se uthata hai, clean karta hai, transform karta hai, validate karta hai, aur final destination pe load karta hai — ek defined order mein.
- DAG (Directed Acyclic Graph) pipeline ka structure define karta hai. Har step ek task hai. Task A complete hone ke baad hi Task B chalega. Koi cycle nahi — infinite loop nahi hoga.
- Cron jobs scheduler hain, orchestrator nahi. Unhe dependencies track karna, retry karna, ya alerts bhejne nahi aata. Production mein silent failures ka main cause yahi hain.
- Apache Airflow industry standard hai — Python mein DAGs likhte ho, web UI milta hai pipelines monitor karne ke liye. Large teams aur complex workflows ke liye best choice hai.
- Prefect modern alternative hai — pure Python decorators, easy local testing, fast setup. Naye projects ya smaller teams ke liye Prefect Airflow se simpler hai.
- Asli Data Engineering skill sirf data transform karna nahi hai — use automate karna, monitor karna, retry handle karna, aur production-ready banana hai. Yahi skill Data Pipelines se aati hai.
Full Pipeline Guide Chahiye?
Is reel ko save karo taaki baad mein revise kar sako.
Comment PIPELINE, AIRFLOW, ETL, ya DATAFLOW on Instagram —
jo bhi chahiye, poora detail guide aapko DM mein mil jayega.