Python for Data Engineering - My Experience with Pandas and FastAPI
12 Jun 2024While Java and Spring Boot are my primary tools for building microservices, Python has become an invaluable part of my toolkit for data engineering tasks. Here’s what I’ve learned using Python, Pandas, and FastAPI in production.
The Use Case
At Infosys, I was tasked with building a data validation and transformation framework that would:
- Validate data across multiple heterogeneous databases
- Transform data between different formats and schemas
- Provide REST APIs for data aggregation and quality checks
- Generate reports and analytics on data consistency
Why Python?
For this use case, Python was the perfect choice:
Pandas for Data Manipulation
Pandas made it incredibly easy to:
- Load data from multiple database sources
- Perform complex transformations and aggregations
- Handle missing or inconsistent data
- Generate statistical summaries
NumPy for Numerical Operations
NumPy provided efficient array operations for:
- Mathematical transformations
- Statistical calculations
- Performance-critical computations
FastAPI for Building APIs
FastAPI became my go-to framework for building REST APIs because:
- Automatic API documentation with Swagger/OpenAPI
- Built-in data validation using Pydantic models
- Async support for high-performance endpoints
- Type hints for better code quality and IDE support
Real-World Example: Data Validation Suite
Here’s how I structured the Data Validation Suite:
1. Data Extraction Layer
# Extract data from Oracle, PostgreSQL, SQL Server
import pandas as pd
from sqlalchemy import create_engine
def extract_data(connection_string, query):
engine = create_engine(connection_string)
df = pd.read_sql(query, engine)
return df
2. Validation Rules Engine
# Define customizable validation rules
class ValidationRule:
def __init__(self, name, condition, severity):
self.name = name
self.condition = condition
self.severity = severity
def validate(self, df):
violations = df[~df.apply(self.condition, axis=1)]
return violations
3. FastAPI Endpoints
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class ValidationRequest(BaseModel):
source: str
rules: List[str]
@app.post("/validate")
async def validate_data(request: ValidationRequest):
# Extract, validate, and return results
results = run_validation(request.source, request.rules)
return {"status": "success", "violations": results}
4. Deployment
Containerized with Docker and deployed on Kubernetes alongside our Java microservices.
Key Learnings
1. Pandas is Powerful but Memory-Intensive
For large datasets (millions of rows), I learned to:
- Use chunking to process data in batches
- Optimize data types (e.g., use
categoryfor strings with few unique values) - Use
daskfor datasets that don’t fit in memory
2. FastAPI’s Async Capabilities are Game-Changing
Async endpoints allowed me to:
- Handle concurrent requests efficiently
- Perform non-blocking database operations
- Improve overall API throughput
3. Type Hints Improve Code Quality
Using Pydantic models and type hints:
- Caught errors at development time
- Provided automatic request/response validation
- Made the code self-documenting
4. Testing is Essential
I wrote comprehensive tests using pytest:
- Unit tests for validation rules
- Integration tests for database operations
- API tests using FastAPI’s TestClient
Performance Optimization
For production workloads, I implemented:
Caching
Used Redis to cache frequently accessed data and validation results.
Database Connection Pooling
Configured SQLAlchemy with connection pools to avoid connection overhead.
Parallel Processing
Used Python’s concurrent.futures for parallel validation across multiple data sources.
Profiling
Used cProfile and memory_profiler to identify and optimize bottlenecks.
Integration with Java Microservices
The Python services integrated seamlessly with our Java ecosystem:
- REST APIs for synchronous communication
- Kafka for async event streaming
- Shared PostgreSQL database for persistence
- Common monitoring and logging infrastructure
Impact
The Data Validation Suite:
- Improved data accuracy by catching inconsistencies early
- Reduced manual validation effort by 80%
- Provided real-time visibility into data quality
- Became a critical tool for data governance
Conclusion
Python, Pandas, and FastAPI are excellent tools for data engineering tasks. They complement Java/Spring Boot beautifully, allowing you to use the right tool for the right job.
If you’re building data-intensive applications, I highly recommend adding Python to your toolkit. The ecosystem is mature, the libraries are powerful, and the developer experience is fantastic.