# Salary Analytics A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports. ## Features - **Transaction Analysis** - Keyword-based salary transaction identification - Consistent amount transaction analysis - Transaction type analysis - Hypothesis overlap visualization - **Salary Earner Classification** - Verified salary earners identification - Likely salary earners identification - High earner detection - Salary pattern analysis - **Machine Learning** - Salary prediction models - Separate models for consistent and inconsistent earners - Feature engineering - Model evaluation metrics - Model persistence (saved in output/models) - **Reporting** - CSV reports generation - Visualization plots - High earner details - Salary earner statistics ## Architecture The project is organized into the following modules: ``` salary_analytics/ ├── __init__.py ├── config.py # Configuration settings ├── data_loader.py # Database connection and data loading ├── keyword_analyzer.py # Keyword-based analysis ├── consistent_amount_analyzer.py # Consistent amount analysis ├── transaction_type_analyzer.py # Transaction type analysis ├── salary_earner_analyzer.py # Salary earner analysis ├── salary_predictor.py # Machine learning models ├── main.py # Main pipeline └── api.py # FastAPI endpoints ``` ## Configuration The system can be configured through environment variables using a `.env` file: 1. Copy the example environment file: ```bash cp .env.example .env ``` 2. Edit the `.env` file with your database credentials: ```bash DB_USER=your_username DB_PASSWORD=your_password DB_NAME=your_database DB_PORT=your_port DB_HOST=your_host ``` ## Usage ### Using the API 1. Start the API server: ```bash uvicorn salary_analytics.api:app --reload ``` 2. Access the API documentation: - Swagger UI: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc ### API Endpoints 1. **Basic Endpoints** - `GET /`: Welcome message - `GET /health`: Health check 2. **Data Loading** - `POST /load-data`: Load transaction data - Parameters: - `source`: Data source ('db' or 'csv') - `file`: CSV file (required if source is 'csv') - Example: ```bash # Load from database curl -X POST "http://localhost:8000/load-data?source=db" # Load from CSV curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv" ``` 3. **Analysis Endpoints** - `POST /analyze/keyword`: Run keyword analysis - `POST /analyze/consistent-amount`: Run consistent amount analysis - `POST /analyze/transaction-type`: Run transaction type analysis 4. **Report Generation** - `POST /generate/reports`: Generate all reports - `GET /download/{report_type}`: Download specific reports - Available types: - `high_earners`: High earner details - `likely_earners`: Likely salary earners - `final_table`: Final analysis table - `consistent_plot`: Consistent earners plot - `inconsistent_plot`: Inconsistent earners plot - `hypothesis_plot`: Hypothesis overlap plot 5. **Model Training** - `POST /train/models`: Train prediction models 6. **Pipeline** - `POST /run/pipeline`: Run complete pipeline - `POST /run/streaming-pipeline`: Run pipeline in batches - Parameters: - `source`: Data source ('db' or 'csv') - `file`: CSV file (required if source is 'csv') - `batch_size`: Number of rows to process in each batch (default: 10000) - Example: ```bash # Run streaming pipeline from database curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000" # Run streaming pipeline from CSV curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv" ``` - Response: ```json [ { "batch_number": 1, "total_batches": 10, "processed_rows": 5000, "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv", "message": "Successfully processed batch 1 of 10" }, // ... more batch responses ... ] ``` ### Workflow 1. Start the API server 2. Load data using the `/load-data` endpoint 3. Run any of the analysis endpoints 4. Generate and download reports as needed For large datasets, use the streaming pipeline endpoint: 1. Start the API server 2. Run the streaming pipeline with appropriate batch size 3. Monitor batch processing progress 4. Access results in the batch results directory Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first. ## Docker Deployment 1. Build the Docker image: ```bash docker-compose build ``` 2. Run the container with environment variables: ```bash docker run -v $(pwd)/output:/app/output \ -e DB_USER=your_username \ -e DB_PASSWORD=your_password \ -e DB_NAME=your_database \ -e DB_PORT=your_port \ -e DB_HOST=your_host \ salary-analytics ``` The API will be available at http://localhost:8000 ## Output Structure ``` output/ ├── csv/ │ ├── high_earner_details.csv │ ├── likely_salary_earner.csv │ └── final_table.csv ├── plots/ │ ├── consistent_earners_predictions.png │ ├── inconsistent_earners_predictions.png │ └── hypothesis_overlap.png └── models/ ├── consistent_model.joblib ├── inconsistent_model.joblib ├── consistent_scaler.joblib └── inconsistent_scaler.joblib ```