Implement streaming pipeline endpoint for batch processing

- Added `/run/streaming-pipeline` endpoint to process data in batches from either a database or CSV file.
- Introduced `BatchResponse` model for structured responses.
- Updated README with new endpoint details, including parameters and example usage.
- Enhanced error handling and logging during batch processing.
- Ensured data preprocessing and NaN handling in analysis functions.
This commit is contained in:
2025-05-02 14:25:31 +01:00
parent 5767f55686
commit 9c429caa56
10 changed files with 246 additions and 11 deletions
+32
View File
@@ -119,6 +119,32 @@ uvicorn salary_analytics.api:app --reload
6. **Pipeline**
- `POST /run/pipeline`: Run complete pipeline
- `POST /run/streaming-pipeline`: Run pipeline in batches
- Parameters:
- `source`: Data source ('db' or 'csv')
- `file`: CSV file (required if source is 'csv')
- `batch_size`: Number of rows to process in each batch (default: 10000)
- Example:
```bash
# Run streaming pipeline from database
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
# Run streaming pipeline from CSV
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
```
- Response:
```json
[
{
"batch_number": 1,
"total_batches": 10,
"processed_rows": 5000,
"results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
"message": "Successfully processed batch 1 of 10"
},
// ... more batch responses ...
]
```
### Workflow
@@ -127,6 +153,12 @@ uvicorn salary_analytics.api:app --reload
3. Run any of the analysis endpoints
4. Generate and download reports as needed
For large datasets, use the streaming pipeline endpoint:
1. Start the API server
2. Run the streaming pipeline with appropriate batch size
3. Monitor batch processing progress
4. Access results in the batch results directory
Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.
## Docker Deployment