T

salakojoshua1234_gmail.com 5998db9c68 added RAC check implementation

2025-07-08 16:50:42 +01:00

Enhance XLS upload functionality and update requirements. Added Flask, Flask-SQLAlchemy, and Alembic to requirements. Modified database schema in upload_xls.py for improved data handling and added SQLAlchemy configuration in config.py.

2025-06-09 15:34:18 +01:00

migrations

2025-06-09 15:34:18 +01:00

salary_analytics

added RAC check implementation

2025-07-08 16:50:42 +01:00

.dockerignore

Update configuration and ignore files; added openpyxl to requirements

2025-06-09 12:45:54 +01:00

.env.example

Doker fix

2025-05-17 03:55:36 -04:00

.gitignore

Update configuration and ignore files; added openpyxl to requirements

2025-06-09 12:45:54 +01:00

docker-compose.yml

Doker fix

2025-05-17 03:55:36 -04:00

Dockerfile

first commit

2025-05-17 03:52:41 -04:00

PROJECT.md

Add autonomous salary detection feature to API

2025-07-05 19:27:53 +01:00

README.md

first commit

2025-05-17 03:52:41 -04:00

requirements.txt

2025-06-09 15:34:18 +01:00

run.py

2025-06-09 15:34:18 +01:00

README.md

Salary Analytics

A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.

Features

Transaction Analysis
- Keyword-based salary transaction identification
- Consistent amount transaction analysis
- Transaction type analysis
- Hypothesis overlap visualization
Salary Earner Classification
- Verified salary earners identification
- Likely salary earners identification
- High earner detection
- Salary pattern analysis
Machine Learning
- Salary prediction models
- Separate models for consistent and inconsistent earners
- Feature engineering
- Model evaluation metrics
- Model persistence (saved in output/models)
Reporting
- CSV reports generation
- Visualization plots
- High earner details
- Salary earner statistics

Architecture

The project is organized into the following modules:

salary_analytics/
├── __init__.py
├── config.py           # Configuration settings
├── data_loader.py      # Database connection and data loading
├── keyword_analyzer.py # Keyword-based analysis
├── consistent_amount_analyzer.py # Consistent amount analysis
├── transaction_type_analyzer.py  # Transaction type analysis
├── salary_earner_analyzer.py     # Salary earner analysis
├── salary_predictor.py # Machine learning models
├── main.py            # Main pipeline
└── api.py             # FastAPI endpoints

Configuration

The system can be configured through environment variables using a .env file:

Copy the example environment file:

cp .env.example .env

Edit the .env file with your database credentials:

DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=your_database
DB_PORT=your_port
DB_HOST=your_host

Usage

Using the API

Start the API server:

uvicorn salary_analytics.api:app --reload

Access the API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

API Endpoints

Basic Endpoints
- GET /: Welcome message
- GET /health: Health check

Data Loading

POST /load-data: Load transaction data

Parameters:
- source: Data source ('db' or 'csv')
- file: CSV file (required if source is 'csv')

Example:

# Load from database
curl -X POST "http://localhost:8000/load-data?source=db"

# Load from CSV
curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"

Analysis Endpoints
- POST /analyze/keyword: Run keyword analysis
- POST /analyze/consistent-amount: Run consistent amount analysis
- POST /analyze/transaction-type: Run transaction type analysis
Report Generation
- POST /generate/reports: Generate all reports
- GET /download/{report_type}: Download specific reports
  - Available types:
    - high_earners: High earner details
    - likely_earners: Likely salary earners
    - final_table: Final analysis table
    - consistent_plot: Consistent earners plot
    - inconsistent_plot: Inconsistent earners plot
    - hypothesis_plot: Hypothesis overlap plot
Model Training
- POST /train/models: Train prediction models

Pipeline

POST /run/pipeline: Run complete pipeline

POST /run/streaming-pipeline: Run pipeline in batches

Parameters:
- source: Data source ('db' or 'csv')
- file: CSV file (required if source is 'csv')
- batch_size: Number of rows to process in each batch (default: 10000)

Example:

# Run streaming pipeline from database
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"

# Run streaming pipeline from CSV
curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"

Response:

[
  {
    "batch_number": 1,
    "total_batches": 10,
    "processed_rows": 5000,
    "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
    "message": "Successfully processed batch 1 of 10"
  },
  // ... more batch responses ...
]

Workflow

Start the API server
Load data using the /load-data endpoint
Run any of the analysis endpoints
Generate and download reports as needed

For large datasets, use the streaming pipeline endpoint:

Start the API server
Run the streaming pipeline with appropriate batch size
Monitor batch processing progress
Access results in the batch results directory

Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.

Docker Deployment

Build the Docker image:

docker-compose build

Run the container with environment variables:

docker run -v $(pwd)/output:/app/output \
           -e DB_USER=your_username \
           -e DB_PASSWORD=your_password \
           -e DB_NAME=your_database \
           -e DB_PORT=your_port \
           -e DB_HOST=your_host \
           salary-analytics

The API will be available at http://localhost:8000

Output Structure

output/
├── csv/
│   ├── high_earner_details.csv
│   ├── likely_salary_earner.csv
│   └── final_table.csv
├── plots/
│   ├── consistent_earners_predictions.png
│   ├── inconsistent_earners_predictions.png
│   └── hypothesis_overlap.png
└── models/
    ├── consistent_model.joblib
    ├── inconsistent_model.joblib
    ├── consistent_scaler.joblib
    └── inconsistent_scaler.joblib