salakojoshua1234_gmail.com 99e1b82ea8 Add autonomous salary detection feature to API
Integrated SalaryDetect class into the API and initiated an autonomous salary detection loop during the startup event. This enhancement improves the system's capability to monitor and analyze salary data in real-time.
2025-07-05 19:27:53 +01:00
2025-05-17 03:55:36 -04:00
2025-05-17 03:55:36 -04:00
2025-05-17 03:52:41 -04:00
2025-05-17 03:52:41 -04:00

Salary Analytics

A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.

Features

  • Transaction Analysis

    • Keyword-based salary transaction identification
    • Consistent amount transaction analysis
    • Transaction type analysis
    • Hypothesis overlap visualization
  • Salary Earner Classification

    • Verified salary earners identification
    • Likely salary earners identification
    • High earner detection
    • Salary pattern analysis
  • Machine Learning

    • Salary prediction models
    • Separate models for consistent and inconsistent earners
    • Feature engineering
    • Model evaluation metrics
    • Model persistence (saved in output/models)
  • Reporting

    • CSV reports generation
    • Visualization plots
    • High earner details
    • Salary earner statistics

Architecture

The project is organized into the following modules:

salary_analytics/
├── __init__.py
├── config.py           # Configuration settings
├── data_loader.py      # Database connection and data loading
├── keyword_analyzer.py # Keyword-based analysis
├── consistent_amount_analyzer.py # Consistent amount analysis
├── transaction_type_analyzer.py  # Transaction type analysis
├── salary_earner_analyzer.py     # Salary earner analysis
├── salary_predictor.py # Machine learning models
├── main.py            # Main pipeline
└── api.py             # FastAPI endpoints

Configuration

The system can be configured through environment variables using a .env file:

  1. Copy the example environment file:
cp .env.example .env
  1. Edit the .env file with your database credentials:
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=your_database
DB_PORT=your_port
DB_HOST=your_host

Usage

Using the API

  1. Start the API server:
uvicorn salary_analytics.api:app --reload
  1. Access the API documentation:

API Endpoints

  1. Basic Endpoints

    • GET /: Welcome message
    • GET /health: Health check
  2. Data Loading

    • POST /load-data: Load transaction data
      • Parameters:
        • source: Data source ('db' or 'csv')
        • file: CSV file (required if source is 'csv')
      • Example:
        # Load from database
        curl -X POST "http://localhost:8000/load-data?source=db"
        
        # Load from CSV
        curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"
        
  3. Analysis Endpoints

    • POST /analyze/keyword: Run keyword analysis
    • POST /analyze/consistent-amount: Run consistent amount analysis
    • POST /analyze/transaction-type: Run transaction type analysis
  4. Report Generation

    • POST /generate/reports: Generate all reports
    • GET /download/{report_type}: Download specific reports
      • Available types:
        • high_earners: High earner details
        • likely_earners: Likely salary earners
        • final_table: Final analysis table
        • consistent_plot: Consistent earners plot
        • inconsistent_plot: Inconsistent earners plot
        • hypothesis_plot: Hypothesis overlap plot
  5. Model Training

    • POST /train/models: Train prediction models
  6. Pipeline

    • POST /run/pipeline: Run complete pipeline
    • POST /run/streaming-pipeline: Run pipeline in batches
      • Parameters:
        • source: Data source ('db' or 'csv')
        • file: CSV file (required if source is 'csv')
        • batch_size: Number of rows to process in each batch (default: 10000)
      • Example:
        # Run streaming pipeline from database
        curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"
        
        # Run streaming pipeline from CSV
        curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
        
      • Response:
        [
          {
            "batch_number": 1,
            "total_batches": 10,
            "processed_rows": 5000,
            "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
            "message": "Successfully processed batch 1 of 10"
          },
          // ... more batch responses ...
        ]
        

Workflow

  1. Start the API server
  2. Load data using the /load-data endpoint
  3. Run any of the analysis endpoints
  4. Generate and download reports as needed

For large datasets, use the streaming pipeline endpoint:

  1. Start the API server
  2. Run the streaming pipeline with appropriate batch size
  3. Monitor batch processing progress
  4. Access results in the batch results directory

Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.

Docker Deployment

  1. Build the Docker image:
docker-compose build
  1. Run the container with environment variables:
docker run -v $(pwd)/output:/app/output \
           -e DB_USER=your_username \
           -e DB_PASSWORD=your_password \
           -e DB_NAME=your_database \
           -e DB_PORT=your_port \
           -e DB_HOST=your_host \
           salary-analytics

The API will be available at http://localhost:8000

Output Structure

output/
├── csv/
│   ├── high_earner_details.csv
│   ├── likely_salary_earner.csv
│   └── final_table.csv
├── plots/
│   ├── consistent_earners_predictions.png
│   ├── inconsistent_earners_predictions.png
│   └── hypothesis_overlap.png
└── models/
    ├── consistent_model.joblib
    ├── inconsistent_model.joblib
    ├── consistent_scaler.joblib
    └── inconsistent_scaler.joblib
S
Description
No description provided
Readme 3.6 MiB
Languages
Python 98.9%
Mako 0.6%
Dockerfile 0.5%