99e1b82ea8a8fafdb73f77a4c522d41ece8dda03
Integrated SalaryDetect class into the API and initiated an autonomous salary detection loop during the startup event. This enhancement improves the system's capability to monitor and analyze salary data in real-time.
Salary Analytics
A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.
Features
-
Transaction Analysis
- Keyword-based salary transaction identification
- Consistent amount transaction analysis
- Transaction type analysis
- Hypothesis overlap visualization
-
Salary Earner Classification
- Verified salary earners identification
- Likely salary earners identification
- High earner detection
- Salary pattern analysis
-
Machine Learning
- Salary prediction models
- Separate models for consistent and inconsistent earners
- Feature engineering
- Model evaluation metrics
- Model persistence (saved in output/models)
-
Reporting
- CSV reports generation
- Visualization plots
- High earner details
- Salary earner statistics
Architecture
The project is organized into the following modules:
salary_analytics/
├── __init__.py
├── config.py # Configuration settings
├── data_loader.py # Database connection and data loading
├── keyword_analyzer.py # Keyword-based analysis
├── consistent_amount_analyzer.py # Consistent amount analysis
├── transaction_type_analyzer.py # Transaction type analysis
├── salary_earner_analyzer.py # Salary earner analysis
├── salary_predictor.py # Machine learning models
├── main.py # Main pipeline
└── api.py # FastAPI endpoints
Configuration
The system can be configured through environment variables using a .env file:
- Copy the example environment file:
cp .env.example .env
- Edit the
.envfile with your database credentials:
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=your_database
DB_PORT=your_port
DB_HOST=your_host
Usage
Using the API
- Start the API server:
uvicorn salary_analytics.api:app --reload
- Access the API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
API Endpoints
-
Basic Endpoints
GET /: Welcome messageGET /health: Health check
-
Data Loading
POST /load-data: Load transaction data- Parameters:
source: Data source ('db' or 'csv')file: CSV file (required if source is 'csv')
- Example:
# Load from database curl -X POST "http://localhost:8000/load-data?source=db" # Load from CSV curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"
- Parameters:
-
Analysis Endpoints
POST /analyze/keyword: Run keyword analysisPOST /analyze/consistent-amount: Run consistent amount analysisPOST /analyze/transaction-type: Run transaction type analysis
-
Report Generation
POST /generate/reports: Generate all reportsGET /download/{report_type}: Download specific reports- Available types:
high_earners: High earner detailslikely_earners: Likely salary earnersfinal_table: Final analysis tableconsistent_plot: Consistent earners plotinconsistent_plot: Inconsistent earners plothypothesis_plot: Hypothesis overlap plot
- Available types:
-
Model Training
POST /train/models: Train prediction models
-
Pipeline
POST /run/pipeline: Run complete pipelinePOST /run/streaming-pipeline: Run pipeline in batches- Parameters:
source: Data source ('db' or 'csv')file: CSV file (required if source is 'csv')batch_size: Number of rows to process in each batch (default: 10000)
- Example:
# Run streaming pipeline from database curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000" # Run streaming pipeline from CSV curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv" - Response:
[ { "batch_number": 1, "total_batches": 10, "processed_rows": 5000, "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv", "message": "Successfully processed batch 1 of 10" }, // ... more batch responses ... ]
- Parameters:
Workflow
- Start the API server
- Load data using the
/load-dataendpoint - Run any of the analysis endpoints
- Generate and download reports as needed
For large datasets, use the streaming pipeline endpoint:
- Start the API server
- Run the streaming pipeline with appropriate batch size
- Monitor batch processing progress
- Access results in the batch results directory
Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.
Docker Deployment
- Build the Docker image:
docker-compose build
- Run the container with environment variables:
docker run -v $(pwd)/output:/app/output \
-e DB_USER=your_username \
-e DB_PASSWORD=your_password \
-e DB_NAME=your_database \
-e DB_PORT=your_port \
-e DB_HOST=your_host \
salary-analytics
The API will be available at http://localhost:8000
Output Structure
output/
├── csv/
│ ├── high_earner_details.csv
│ ├── likely_salary_earner.csv
│ └── final_table.csv
├── plots/
│ ├── consistent_earners_predictions.png
│ ├── inconsistent_earners_predictions.png
│ └── hypothesis_overlap.png
└── models/
├── consistent_model.joblib
├── inconsistent_model.joblib
├── consistent_scaler.joblib
└── inconsistent_scaler.joblib
Description
Languages
Python
98.9%
Mako
0.6%
Dockerfile
0.5%