201 lines
5.9 KiB
Markdown
201 lines
5.9 KiB
Markdown
# Salary Analytics
|
|
|
|
A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.
|
|
|
|
## Features
|
|
|
|
- **Transaction Analysis**
|
|
- Keyword-based salary transaction identification
|
|
- Consistent amount transaction analysis
|
|
- Transaction type analysis
|
|
- Hypothesis overlap visualization
|
|
|
|
- **Salary Earner Classification**
|
|
- Verified salary earners identification
|
|
- Likely salary earners identification
|
|
- High earner detection
|
|
- Salary pattern analysis
|
|
|
|
- **Machine Learning**
|
|
- Salary prediction models
|
|
- Separate models for consistent and inconsistent earners
|
|
- Feature engineering
|
|
- Model evaluation metrics
|
|
- Model persistence (saved in output/models)
|
|
|
|
- **Reporting**
|
|
- CSV reports generation
|
|
- Visualization plots
|
|
- High earner details
|
|
- Salary earner statistics
|
|
|
|
## Architecture
|
|
|
|
The project is organized into the following modules:
|
|
|
|
```
|
|
salary_analytics/
|
|
├── __init__.py
|
|
├── config.py # Configuration settings
|
|
├── data_loader.py # Database connection and data loading
|
|
├── keyword_analyzer.py # Keyword-based analysis
|
|
├── consistent_amount_analyzer.py # Consistent amount analysis
|
|
├── transaction_type_analyzer.py # Transaction type analysis
|
|
├── salary_earner_analyzer.py # Salary earner analysis
|
|
├── salary_predictor.py # Machine learning models
|
|
├── main.py # Main pipeline
|
|
└── api.py # FastAPI endpoints
|
|
```
|
|
|
|
## Configuration
|
|
|
|
The system can be configured through environment variables using a `.env` file:
|
|
|
|
1. Copy the example environment file:
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
2. Edit the `.env` file with your database credentials:
|
|
```bash
|
|
DB_USER=your_username
|
|
DB_PASSWORD=your_password
|
|
DB_NAME=your_database
|
|
DB_PORT=your_port
|
|
DB_HOST=your_host
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Using the API
|
|
|
|
1. Start the API server:
|
|
```bash
|
|
uvicorn salary_analytics.api:app --reload
|
|
```
|
|
|
|
2. Access the API documentation:
|
|
- Swagger UI: http://localhost:8000/docs
|
|
- ReDoc: http://localhost:8000/redoc
|
|
|
|
### API Endpoints
|
|
|
|
1. **Basic Endpoints**
|
|
- `GET /`: Welcome message
|
|
- `GET /health`: Health check
|
|
|
|
2. **Data Loading**
|
|
- `POST /load-data`: Load transaction data
|
|
- Parameters:
|
|
- `source`: Data source ('db' or 'csv')
|
|
- `file`: CSV file (required if source is 'csv')
|
|
- Example:
|
|
```bash
|
|
# Load from database
|
|
curl -X POST "http://localhost:8000/load-data?source=db"
|
|
|
|
# Load from CSV
|
|
curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"
|
|
```
|
|
|
|
3. **Analysis Endpoints**
|
|
- `POST /analyze/keyword`: Run keyword analysis
|
|
- `POST /analyze/consistent-amount`: Run consistent amount analysis
|
|
- `POST /analyze/transaction-type`: Run transaction type analysis
|
|
|
|
4. **Report Generation**
|
|
- `POST /generate/reports`: Generate all reports
|
|
- `GET /download/{report_type}`: Download specific reports
|
|
- Available types:
|
|
- `high_earners`: High earner details
|
|
- `likely_earners`: Likely salary earners
|
|
- `final_table`: Final analysis table
|
|
- `consistent_plot`: Consistent earners plot
|
|
- `inconsistent_plot`: Inconsistent earners plot
|
|
- `hypothesis_plot`: Hypothesis overlap plot
|
|
|
|
5. **Model Training**
|
|
- `POST /train/models`: Train prediction models
|
|
|
|
6. **Pipeline**
|
|
- `POST /run/pipeline`: Run complete pipeline
|
|
- `POST /run/streaming-pipeline`: Run pipeline in batches
|
|
- Parameters:
|
|
- `source`: Data source ('db' or 'csv')
|
|
- `file`: CSV file (required if source is 'csv')
|
|
- `batch_size`: Number of rows to process in each batch (default: 10000)
|
|
- Example:
|
|
```bash
|
|
# Run streaming pipeline from database
|
|
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
|
|
|
|
# Run streaming pipeline from CSV
|
|
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
|
|
```
|
|
- Response:
|
|
```json
|
|
[
|
|
{
|
|
"batch_number": 1,
|
|
"total_batches": 10,
|
|
"processed_rows": 5000,
|
|
"results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
|
|
"message": "Successfully processed batch 1 of 10"
|
|
},
|
|
// ... more batch responses ...
|
|
]
|
|
```
|
|
|
|
### Workflow
|
|
|
|
1. Start the API server
|
|
2. Load data using the `/load-data` endpoint
|
|
3. Run any of the analysis endpoints
|
|
4. Generate and download reports as needed
|
|
|
|
For large datasets, use the streaming pipeline endpoint:
|
|
1. Start the API server
|
|
2. Run the streaming pipeline with appropriate batch size
|
|
3. Monitor batch processing progress
|
|
4. Access results in the batch results directory
|
|
|
|
Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.
|
|
|
|
## Docker Deployment
|
|
|
|
1. Build the Docker image:
|
|
```bash
|
|
docker-compose build
|
|
```
|
|
|
|
2. Run the container with environment variables:
|
|
```bash
|
|
docker run -v $(pwd)/output:/app/output \
|
|
-e DB_USER=your_username \
|
|
-e DB_PASSWORD=your_password \
|
|
-e DB_NAME=your_database \
|
|
-e DB_PORT=your_port \
|
|
-e DB_HOST=your_host \
|
|
salary-analytics
|
|
```
|
|
|
|
The API will be available at http://localhost:8000
|
|
|
|
## Output Structure
|
|
|
|
```
|
|
output/
|
|
├── csv/
|
|
│ ├── high_earner_details.csv
|
|
│ ├── likely_salary_earner.csv
|
|
│ └── final_table.csv
|
|
├── plots/
|
|
│ ├── consistent_earners_predictions.png
|
|
│ ├── inconsistent_earners_predictions.png
|
|
│ └── hypothesis_overlap.png
|
|
└── models/
|
|
├── consistent_model.joblib
|
|
├── inconsistent_model.joblib
|
|
├── consistent_scaler.joblib
|
|
└── inconsistent_scaler.joblib
|
|
``` |