Implement streaming pipeline endpoint for batch processing
- Added `/run/streaming-pipeline` endpoint to process data in batches from either a database or CSV file. - Introduced `BatchResponse` model for structured responses. - Updated README with new endpoint details, including parameters and example usage. - Enhanced error handling and logging during batch processing. - Ensured data preprocessing and NaN handling in analysis functions.
This commit is contained in:
@@ -119,6 +119,32 @@ uvicorn salary_analytics.api:app --reload
|
||||
|
||||
6. **Pipeline**
|
||||
- `POST /run/pipeline`: Run complete pipeline
|
||||
- `POST /run/streaming-pipeline`: Run pipeline in batches
|
||||
- Parameters:
|
||||
- `source`: Data source ('db' or 'csv')
|
||||
- `file`: CSV file (required if source is 'csv')
|
||||
- `batch_size`: Number of rows to process in each batch (default: 10000)
|
||||
- Example:
|
||||
```bash
|
||||
# Run streaming pipeline from database
|
||||
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
|
||||
|
||||
# Run streaming pipeline from CSV
|
||||
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
|
||||
```
|
||||
- Response:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"batch_number": 1,
|
||||
"total_batches": 10,
|
||||
"processed_rows": 5000,
|
||||
"results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
|
||||
"message": "Successfully processed batch 1 of 10"
|
||||
},
|
||||
// ... more batch responses ...
|
||||
]
|
||||
```
|
||||
|
||||
### Workflow
|
||||
|
||||
@@ -127,6 +153,12 @@ uvicorn salary_analytics.api:app --reload
|
||||
3. Run any of the analysis endpoints
|
||||
4. Generate and download reports as needed
|
||||
|
||||
For large datasets, use the streaming pipeline endpoint:
|
||||
1. Start the API server
|
||||
2. Run the streaming pipeline with appropriate batch size
|
||||
3. Monitor batch processing progress
|
||||
4. Access results in the batch results directory
|
||||
|
||||
Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
Reference in New Issue
Block a user