Implement streaming pipeline endpoint for batch processing

- Added `/run/streaming-pipeline` endpoint to process data in batches from either a database or CSV file. - Introduced `BatchResponse` model for structured responses. - Updated README with new endpoint details, including parameters and example usage. - Enhanced error handling and logging during batch processing. - Ensured data preprocessing and NaN handling in analysis functions.
2025-05-02 14:25:31 +01:00
parent 5767f55686
commit 9c429caa56
10 changed files with 246 additions and 11 deletions
@@ -119,6 +119,32 @@ uvicorn salary_analytics.api:app --reload

 6. **Pipeline**
   - `POST /run/pipeline`: Run complete pipeline
+   - `POST /run/streaming-pipeline`: Run pipeline in batches
+     - Parameters:
+       - `source`: Data source ('db' or 'csv')
+       - `file`: CSV file (required if source is 'csv')
+       - `batch_size`: Number of rows to process in each batch (default: 10000)
+     - Example:
+       ```bash
+       # Run streaming pipeline from database
+       curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
+       
+       # Run streaming pipeline from CSV
+       curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
+       ```
+     - Response:
+       ```json
+       [
+         {
+           "batch_number": 1,
+           "total_batches": 10,
+           "processed_rows": 5000,
+           "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
+           "message": "Successfully processed batch 1 of 10"
+         },
+         // ... more batch responses ...
+       ]
+       ```

 ### Workflow

@@ -127,6 +153,12 @@ uvicorn salary_analytics.api:app --reload
 3. Run any of the analysis endpoints
 4. Generate and download reports as needed

+For large datasets, use the streaming pipeline endpoint:
+1. Start the API server
+2. Run the streaming pipeline with appropriate batch size
+3. Monitor batch processing progress
+4. Access results in the batch results directory
+
 Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.

 ## Docker Deployment