2025-05-17 03:55:36 -04:00
2025-05-17 03:55:36 -04:00
2025-05-17 03:52:41 -04:00
2025-05-17 03:52:41 -04:00

Salary Analytics

A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.

Features

  • Transaction Analysis

    • Keyword-based salary transaction identification
    • Consistent amount transaction analysis
    • Transaction type analysis
    • Hypothesis overlap visualization
  • Salary Earner Classification

    • Verified salary earners identification
    • Likely salary earners identification
    • High earner detection
    • Salary pattern analysis
  • Machine Learning

    • Salary prediction models
    • Separate models for consistent and inconsistent earners
    • Feature engineering
    • Model evaluation metrics
    • Model persistence (saved in output/models)
  • Reporting

    • CSV reports generation
    • Visualization plots
    • High earner details
    • Salary earner statistics

Architecture

The project is organized into the following modules:

salary_analytics/
├── __init__.py
├── config.py           # Configuration settings
├── data_loader.py      # Database connection and data loading
├── keyword_analyzer.py # Keyword-based analysis
├── consistent_amount_analyzer.py # Consistent amount analysis
├── transaction_type_analyzer.py  # Transaction type analysis
├── salary_earner_analyzer.py     # Salary earner analysis
├── salary_predictor.py # Machine learning models
├── main.py            # Main pipeline
└── api.py             # FastAPI endpoints

Configuration

The system can be configured through environment variables using a .env file:

  1. Copy the example environment file:
cp .env.example .env
  1. Edit the .env file with your database credentials:
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=your_database
DB_PORT=your_port
DB_HOST=your_host

Usage

Using the API

  1. Start the API server:
uvicorn salary_analytics.api:app --reload
  1. Access the API documentation:

API Endpoints

  1. Basic Endpoints

    • GET /: Welcome message
    • GET /health: Health check
  2. Data Loading

    • POST /load-data: Load transaction data
      • Parameters:
        • source: Data source ('db' or 'csv')
        • file: CSV file (required if source is 'csv')
      • Example:
        # Load from database
        curl -X POST "http://localhost:8000/load-data?source=db"
        
        # Load from CSV
        curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"
        
  3. Analysis Endpoints

    • POST /analyze/keyword: Run keyword analysis
    • POST /analyze/consistent-amount: Run consistent amount analysis
    • POST /analyze/transaction-type: Run transaction type analysis
  4. Report Generation

    • POST /generate/reports: Generate all reports
    • GET /download/{report_type}: Download specific reports
      • Available types:
        • high_earners: High earner details
        • likely_earners: Likely salary earners
        • final_table: Final analysis table
        • consistent_plot: Consistent earners plot
        • inconsistent_plot: Inconsistent earners plot
        • hypothesis_plot: Hypothesis overlap plot
  5. Model Training

    • POST /train/models: Train prediction models
  6. Pipeline

    • POST /run/pipeline: Run complete pipeline
    • POST /run/streaming-pipeline: Run pipeline in batches
      • Parameters:
        • source: Data source ('db' or 'csv')
        • file: CSV file (required if source is 'csv')
        • batch_size: Number of rows to process in each batch (default: 10000)
      • Example:
        # Run streaming pipeline from database
        curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
        
        # Run streaming pipeline from CSV
        curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
        
      • Response:
        [
          {
            "batch_number": 1,
            "total_batches": 10,
            "processed_rows": 5000,
            "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
            "message": "Successfully processed batch 1 of 10"
          },
          // ... more batch responses ...
        ]
        

Workflow

  1. Start the API server
  2. Load data using the /load-data endpoint
  3. Run any of the analysis endpoints
  4. Generate and download reports as needed

For large datasets, use the streaming pipeline endpoint:

  1. Start the API server
  2. Run the streaming pipeline with appropriate batch size
  3. Monitor batch processing progress
  4. Access results in the batch results directory

Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.

Docker Deployment

  1. Build the Docker image:
docker-compose build
  1. Run the container with environment variables:
docker run -v $(pwd)/output:/app/output \
           -e DB_USER=your_username \
           -e DB_PASSWORD=your_password \
           -e DB_NAME=your_database \
           -e DB_PORT=your_port \
           -e DB_HOST=your_host \
           salary-analytics

The API will be available at http://localhost:8000

Output Structure

output/
├── csv/
│   ├── high_earner_details.csv
│   ├── likely_salary_earner.csv
│   └── final_table.csv
├── plots/
│   ├── consistent_earners_predictions.png
│   ├── inconsistent_earners_predictions.png
│   └── hypothesis_overlap.png
└── models/
    ├── consistent_model.joblib
    ├── inconsistent_model.joblib
    ├── consistent_scaler.joblib
    └── inconsistent_scaler.joblib
S
Description
No description provided
Readme 3.6 MiB
Languages
Python 98.9%
Mako 0.6%
Dockerfile 0.5%