digifi-Analytics/README.md

# Salary Analytics

A comprehensive salary analytics system that analyzes transaction data to identify salary earners, predict future salaries, and generate detailed reports.

## Features

- **Transaction Analysis**
  - Keyword-based salary transaction identification
  - Consistent amount transaction analysis
  - Transaction type analysis
  - Hypothesis overlap visualization

- **Salary Earner Classification**
  - Verified salary earners identification
  - Likely salary earners identification
  - High earner detection
  - Salary pattern analysis

- **Machine Learning**
  - Salary prediction models
  - Separate models for consistent and inconsistent earners
  - Feature engineering
  - Model evaluation metrics
  - Model persistence (saved in output/models)

- **Reporting**
  - CSV reports generation
  - Visualization plots
  - High earner details
  - Salary earner statistics

## Architecture

The project is organized into the following modules:

```
salary_analytics/
├── __init__.py
├── config.py           # Configuration settings
├── data_loader.py      # Database connection and data loading
├── keyword_analyzer.py # Keyword-based analysis
├── consistent_amount_analyzer.py # Consistent amount analysis
├── transaction_type_analyzer.py  # Transaction type analysis
├── salary_earner_analyzer.py     # Salary earner analysis
├── salary_predictor.py # Machine learning models
├── main.py            # Main pipeline
└── api.py             # FastAPI endpoints
```

## Configuration

The system can be configured through environment variables using a `.env` file:

1. Copy the example environment file:
```bash
cp .env.example .env
```

2. Edit the `.env` file with your database credentials:
```bash
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=your_database
DB_PORT=your_port
DB_HOST=your_host
```

## Usage

### Using the API

1. Start the API server:
```bash
uvicorn salary_analytics.api:app --reload
```

2. Access the API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

### API Endpoints

1. **Basic Endpoints**
   - `GET /`: Welcome message
   - `GET /health`: Health check

2. **Data Loading**
   - `POST /load-data`: Load transaction data
     - Parameters:
       - `source`: Data source ('db' or 'csv')
       - `file`: CSV file (required if source is 'csv')
     - Example:
       ```bash
       # Load from database
       curl -X POST "http://localhost:8000/load-data?source=db"

       # Load from CSV
       curl -X POST "http://localhost:8000/load-data?source=csv" -F "file=@path/to/your/file.csv"
       ```

3. **Analysis Endpoints**
   - `POST /analyze/keyword`: Run keyword analysis
   - `POST /analyze/consistent-amount`: Run consistent amount analysis
   - `POST /analyze/transaction-type`: Run transaction type analysis

4. **Report Generation**
   - `POST /generate/reports`: Generate all reports
   - `GET /download/{report_type}`: Download specific reports
     - Available types:
       - `high_earners`: High earner details
       - `likely_earners`: Likely salary earners
       - `final_table`: Final analysis table
       - `consistent_plot`: Consistent earners plot
       - `inconsistent_plot`: Inconsistent earners plot
       - `hypothesis_plot`: Hypothesis overlap plot

5. **Model Training**
   - `POST /train/models`: Train prediction models

6. **Pipeline**
   - `POST /run/pipeline`: Run complete pipeline
   - `POST /run/streaming-pipeline`: Run pipeline in batches
     - Parameters:
       - `source`: Data source ('db' or 'csv')
       - `file`: CSV file (required if source is 'csv')
       - `batch_size`: Number of rows to process in each batch (default: 10000)
     - Example:
       ```bash
       # Run streaming pipeline from database
       curl -X POST "http://localhost:8000/run/streaming-pipeline?source=db&batch_size=5000"

       # Run streaming pipeline from CSV
       curl -X POST "http://localhost:8000/run/streaming-pipeline?source=csv&batch_size=5000" -F "file=@path/to/your/file.csv"
       ```
     - Response:
       ```json
       [
         {
           "batch_number": 1,
           "total_batches": 10,
           "processed_rows": 5000,
           "results_path": "/app/output/csv/batch_results_20240315_123456/batch_1_results.csv",
           "message": "Successfully processed batch 1 of 10"
         },
         // ... more batch responses ...
       ]
       ```

### Workflow

1. Start the API server
2. Load data using the `/load-data` endpoint
3. Run any of the analysis endpoints
4. Generate and download reports as needed

For large datasets, use the streaming pipeline endpoint:
1. Start the API server
2. Run the streaming pipeline with appropriate batch size
3. Monitor batch processing progress
4. Access results in the batch results directory

Note: All analysis endpoints require data to be loaded first. If you try to run any analysis without loading data, you'll receive a 400 error with a message to load data first.

## Docker Deployment

1. Build the Docker image:
```bash
docker-compose build
```

2. Run the container with environment variables:
```bash
docker run -v $(pwd)/output:/app/output \
           -e DB_USER=your_username \
           -e DB_PASSWORD=your_password \
           -e DB_NAME=your_database \
           -e DB_PORT=your_port \
           -e DB_HOST=your_host \
           salary-analytics
```

The API will be available at http://localhost:8000

## Output Structure

```
output/
├── csv/
│   ├── high_earner_details.csv
│   ├── likely_salary_earner.csv
│   └── final_table.csv
├── plots/
│   ├── consistent_earners_predictions.png
│   ├── inconsistent_earners_predictions.png
│   └── hypothesis_overlap.png
└── models/
    ├── consistent_model.joblib
    ├── inconsistent_model.joblib
    ├── consistent_scaler.joblib
    └── inconsistent_scaler.joblib
```