Salary Prediction from Transaction Data¶

1. Introduction¶

This report details the process of identifying salary earners and predicting their monthly salary using transaction data from a database. The approach combines rule-based heuristics (based on financial domain knowledge) and machine learning to achieve this goal.

2. Data Source¶

The analysis uses transaction data from the customer_account_transaction_hx table in a PostgreSQL database. The table contains information about customer transactions, including transaction type, subtype, amount, description, and timestamps.

3. Methodology¶

3.1. Identifying Salary Earners¶

Four hypotheses are used to identify potential salary earners:

  1. Keyword-Based: Transactions with descriptions containing salary-related keywords (e.g., "salary," "payroll") and initiated by 'C' (credit) are flagged as potential salary transactions.

  2. Regular Intervals: Transactions occurring at regular monthly intervals (with some tolerance) and initiated by 'C' (credit) are considered indicative of salary payments.

  3. Consistent Amounts: Accounts with low variance in transaction amounts for credit transactions are flagged as potentially receiving consistent salary payments.

  4. Transaction Type/Subtype Combination: Transactions matching specific combinations of transaction type, subtype (e.g., 'T' for Transfer, 'BI' for Bank Initiated), and initiated by 'C' (credit) are considered potential salary transactions.

These hypotheses are evaluated individually and in combination using Venn diagrams to identify accounts with the strongest evidence of being salary earners. Accounts flagged by all three hypotheses are considered "verified salary earners," while those flagged by a subset of hypotheses are considered "likely salary earners."

3.2. Salary Prediction¶

For verified and likely salary earners, a machine learning model is trained to predict their monthly salary. This involves:

  1. Feature Engineering:

    • Extracting the month of the transaction.
    • Creating a month sequence for each account.
    • One-hot encoding the transaction type (trx_type).
    • Calculating rolling sum and average of transaction amounts over a 3-month window.
  2. Data Preparation:

    • Filtering transactions for the selected accounts.
    • Dropping unnecessary columns.
    • Aggregating data monthly using the engineered features.
    • Filtering accounts with at least 12 months of data.
    • Creating training and testing sets using a sliding window approach.
  3. Model Training and Evaluation:

    • A Random Forest Regressor is trained on the scaled training data.
    • Model performance is evaluated on the testing data using metrics like MAE, RMSE, and R-squared.

Import Libraries¶

InĀ [1]:
!pip install sqlalchemy psycopg2-binary --quiet
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 27.3 MB/s eta 0:00:00
InĀ [52]:
from sqlalchemy import create_engine, text
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import calendar
from datetime import datetime
import re
import seaborn as sns
from matplotlib_venn import venn3, venn2
from wordcloud import WordCloud
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder

Import Data from Database¶

  • Connect to postgres engine
  • Load table customer_account_transaction_hx from database
InĀ [3]:
DB_USER = "salaryloan"
DB_PASSWORD = "salaryloan"
DB_NAME = "salaryloan"
DB_PORT = "10532"
DB_HOST = "dev-data.simbrellang.net"
DATABASE_URL = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

engine = create_engine(DATABASE_URL)

try:
    with engine.connect() as conn:
        result = conn.execute(text("SELECT version();"))
        print("Connected successfully!")
except Exception as e:
    print("Error connecting to database:", e)
Connected successfully!
InĀ [4]:
table_name = "customer_account_transaction_hx"

df = pd.read_sql(f"SELECT * FROM {table_name}", engine)
df.head(10)
Out[4]:
id accountid trx_start_date trx_end_date amount d1 d2 d3 description d4
0 511 1659404015 2024-03-24 2024-03-24 204.0 T CI D ATM PUR @ FCMBGOMBE NEWMKT ATM3 GOMBE ... 1659404
1 512 1659404015 2024-03-22 2024-03-22 107.0 T CI D ATM PUR @ FCMBGOMBE NEWMKT ATM3 GOMBE ... 1659404
2 513 1659404015 2024-03-18 2024-03-18 210.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404
3 514 1659404015 2024-03-21 2024-03-21 324.0 T CI D ATM PUR @ FCMB NEW MKT RD, GOMBE II ... 1659404
4 515 1659404015 2024-03-18 2024-03-18 212.0 T CI D ATM PUR @ FCMB NEW MKT RD, GOMBE II ... 1659404
5 516 1659404015 2024-03-17 2024-03-17 206.0 T CI D ATM PUR @ FCMB NEW MKT RD, GOMBE II ... 1659404
6 517 1659404015 2024-03-17 2024-03-17 202.0 T CI D ATM PUR @ FCMB NEW MKT RD, GOMBE II ... 1659404
7 518 1659404015 2024-03-22 2024-03-22 315.0 T CI D ATM PUR @ FCMB NEW MKT RD, GOMBE II ... 1659404
8 519 1659404015 2024-03-21 2024-03-21 216.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404
9 520 2014836018 2024-03-22 2024-03-22 10200.0 T CI C NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES 2014836

Data Overiew and EDA¶

InĀ [5]:
df.shape
Out[5]:
(5354307, 10)
InĀ [6]:
df.dtypes
Out[6]:
0
id int64
accountid object
trx_start_date datetime64[ns]
trx_end_date datetime64[ns]
amount float64
d1 object
d2 object
d3 object
description object
d4 object

InĀ [7]:
df.nunique()
Out[7]:
0
id 5354307
accountid 57496
trx_start_date 182
trx_end_date 182
amount 82941
d1 3
d2 10
d3 2
description 218834
d4 57440

There are 57440 unique custormers

InĀ [8]:
for i in ['d1', 'd2', 'd3']:
    unique_values = df[i].unique()
    print(i, ':', unique_values)
d1 : ['T' 'C' 'L']
d2 : ['CI' 'BI' 'SI' 'NP' 'NR' 'I' 'BS' 'O' 'IP' 'SC']
d3 : ['D' 'C']

T: Transfer. This might indicate a transfer of funds between accounts or within the same account.

C: Credit. This would represent funds being added to an account, like a deposit, salary payment, or refund.

L: Could stand for Loan. This would represent funds being disbursed as part of a loan agreement.

InĀ [9]:
# Check for missing values
df.isna().sum()
Out[9]:
0
id 0
accountid 0
trx_start_date 0
trx_end_date 0
amount 0
d1 0
d2 0
d3 0
description 0
d4 0

InĀ [10]:
# Change to date column to datetime
df['trx_start_date'] = pd.to_datetime(df['trx_start_date'])
df['trx_end_date'] = pd.to_datetime(df['trx_end_date'])
InĀ [11]:
# Check for days difference
inconsistent_dates = df[df['trx_start_date'] != df['trx_end_date']]
print(inconsistent_dates.head(10))
Empty DataFrame
Columns: [id, accountid, trx_start_date, trx_end_date, amount, d1, d2, d3, description, d4]
Index: []

Apparently all transactions started and ended on the same day

InĀ [12]:
# Rename columns
df = df.rename(columns={'d1': 'trx_type', 'd2': 'trx_subtype',
                        'd3': 'initiated_by', 'd4': 'customer_id'})
print(df.columns)
Index(['id', 'accountid', 'trx_start_date', 'trx_end_date', 'amount',
       'trx_type', 'trx_subtype', 'initiated_by', 'description',
       'customer_id'],
      dtype='object')

Monthly Salary/Income Detection¶

Hypothesis 1: Keywords in Descriptions for a Credit Transaction¶

Criteria: Search for common salary/income-related keywords in the description column.

  • Rationale: Employers often label salary payments with descriptive terms like "salary," "wages," "payroll", "income" or company-specific identifiers.
  • Implementation: Use string matching or regular expressions to detect relevant keywords in the description field. Plus it must be a credit.
InĀ [13]:
keywords = [
    "salary", "payroll", "income", "wage", "wages",
    "earnings", "earning", "monthly pay", "net pay", "gross pay", "compensation",
    "monthlypay", "netpay", "grosspay",
    "remuneration", "stipend", "allowance", "bonus", "commission",
    "pension", "retirement", "dividend", "benefits", "reimbursement",
    "overtime", "incentive", "paycheck", "paycheque", "salary advance",
    "monthly income", "income tax refund", "employer deposit",
    "payroll deposit", "salary credit", "income credit", "salary transfer",
    "income transfer", "salary received", "income received", "hr deposit",
    "company deposit", "employer payment", "employee payment"
]


def identify_salary_transactions(df, keywords):
    """
    Identifies potential salary-related transactions based on keywords
    and month-year patterns in the 'description' column.

    Args:
        df (pd.DataFrame): The input DataFrame containing transaction data.
        keywords (list): A list of salary/income-related keywords to search for.

    Returns:
        pd.DataFrame: The input DataFrame with an added 'is_salary_related' column
                      indicating potential salary transactions.
    """
    month_year_patterns = [
        r"\b(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s?\d{2,4}\b",
        r"\b(?:JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)\s?\d{2,4}\b"
    ]

    escaped_keywords = [re.escape(keyword.lower()) for keyword in keywords]
    combined_pattern = (
        r'\b(?:' + '|'.join(escaped_keywords) + r')\b|' +
        '|'.join(month_year_patterns)
        )

    df['is_salary_related'] = df['description'].str.lower().str.contains(
        combined_pattern,
        na=False,
        regex=True
    )

    return df
InĀ [14]:
desc_df = identify_salary_transactions(df, keywords)
desc_data = desc_df[(desc_df['is_salary_related'] == True) & (desc_df['initiated_by'] == 'C')]
desc_data.head()
Out[14]:
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related
21 532 2089119018 2024-03-25 2024-03-25 62208.0000 T BI C March 2018 Allowance 2089119 True
63 574 2099839010 2024-03-24 2024-03-24 45396.0000 T BI C March 2018 Allowance 2099839 True
779 1034 1776036014 2024-03-25 2024-03-25 366.8219 T BS C SALARY FOR MAR 2018 1776036 True
1208 1460 2274196017 2024-03-21 2024-03-21 24401.5200 T CI C wages/ARIBIFO BABATUNDE 2274196 True
1214 1466 2175103017 2024-03-23 2024-03-23 34300.0000 T BI C March 2018 Allowance 2175103 True
InĀ [15]:
desc_data.shape
Out[15]:
(10808, 11)
InĀ [16]:
desc_data['accountid'].nunique()
Out[16]:
756
InĀ [17]:
desc_data['customer_id'].nunique()
Out[17]:
756
InĀ [19]:
flagged_keywords = desc_data[desc_data['is_salary_related']]['description'].str.lower().str.cat(sep=' ')
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(flagged_keywords)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Keywords in Flagged Transactions')
plt.show()
No description has been provided for this image
InĀ [20]:
def plot_trx_type_and_subtype(data, trx_type_col='trx_type', trx_subtype_col='trx_subtype', figsize=(12, 6)):
    """
    Creates count plots for `trx_type` and `trx_subtype` in the provided DataFrame.

    Args:
        data (pd.DataFrame): Input DataFrame containing transaction data.
        trx_type_col (str): Column name for transaction type (default: 'trx_type').
        trx_subtype_col (str): Column name for transaction subtype (default: 'trx_subtype').
        figsize (tuple): Figure size for the plots (default: (12, 6)).
    """
    # Validate input columns
    if trx_type_col not in data.columns or trx_subtype_col not in data.columns:
        raise ValueError(f"Columns '{trx_type_col}' and/or '{trx_subtype_col}' not found in the DataFrame.")

    # Create subplots
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Plot trx_type
    sns.countplot(
        data=data,
        x=trx_type_col,
        ax=axes[0],
        palette='viridis',
        order=data[trx_type_col].value_counts().index
    )
    axes[0].set_title(f'Count of {trx_type_col}')
    axes[0].set_xlabel(trx_type_col)
    axes[0].set_ylabel('Count')
    axes[0].tick_params(axis='x', rotation=45)

    # Plot trx_subtype
    sns.countplot(
        data=data,
        x=trx_subtype_col,
        ax=axes[1],
        palette='coolwarm',
        order=data[trx_subtype_col].value_counts().index
    )
    axes[1].set_title(f'Count of {trx_subtype_col}')
    axes[1].set_xlabel(trx_subtype_col)
    axes[1].set_ylabel('Count')
    axes[1].tick_params(axis='x', rotation=45)

    # Adjust layout
    plt.tight_layout()
    plt.show()
InĀ [21]:
plot_trx_type_and_subtype(desc_data)
No description has been provided for this image

Hypothesis 2: Regular Monthly Intervals¶

Goal: To identify potential salary earners by checking if they receive consistent payments at regular intervals, similar to how salaries are typically paid.

Logic:

  1. Focus on Credits: We only look at transactions where money is being added to an account (credits), as these are more likely to represent salary payments.
  2. Calculate Time Gaps: We measure the time between these credit transactions for each individual account.
  3. Check for Consistency: If the time gaps are fairly consistent (around 30 days, with some flexibility for weekends or holidays), it suggests a regular payment like a salary.
  4. Flag Regular Payments: We mark transactions that fall within this consistent pattern.
  5. Enough Regular Payments? For an account to be considered a likely salary earner, a significant portion of their credit transactions should be flagged as regular. We have set a minimum number of transactions and a threshold for the percentage of regular payments to be considered.

Example:

If a person consistently receives a payment around the 30th of every month for several months, this hypothesis would identify them as a potential salary earner.

InĀ [22]:
def identify_regular_intervals(df, tolerance=0.15):
    """
    Identifies transactions occurring at regular intervals based on 'C' initiated transactions.

    Args:
        df (pd.DataFrame): Transaction data with columns ['accountid', 'trx_start_date', 'initiated_by'].
        tolerance (float): Relative tolerance for interval deviation (default: 15%).

    Returns:
        pd.DataFrame: Input DataFrame with added 'is_regular_interval' column.
    """

    # Sort the DataFrame
    df = df.sort_values(by=['accountid', 'trx_start_date'])

    # Compute median interval based on 'C' transactions
    c_df = df[df['initiated_by'] == 'C'].copy()
    c_df['days_since_last_C_trx'] = c_df.groupby('accountid')['trx_start_date'].diff().dt.days
    median_intervals = c_df.groupby('accountid')['days_since_last_C_trx'].median().reset_index(name='median_interval')

    # Merge median_interval into the original DataFrame
    df = df.merge(median_intervals, on='accountid', how='left')

    # Track last 'C' transaction date for each account
    # Change this line:
    df['last_C_date'] = df.groupby('accountid')['trx_start_date'].transform(lambda x: x.where(df.loc[x.index, 'initiated_by'] == 'C').ffill())

    # Compute days_since_last_C_trx
    df['days_since_last_C_trx'] = (df['trx_start_date'] - df['last_C_date']).dt.days
    df['days_since_last_C_trx'] = df['days_since_last_C_trx'].where(df['last_C_date'].notna())

    # Determine if the interval is regular
    df['is_regular_interval'] = False
    valid_mask = df['days_since_last_C_trx'].notna() & df['median_interval'].notna()
    df.loc[valid_mask, 'is_regular_interval'] = df[valid_mask].apply(
        lambda row: abs(row['days_since_last_C_trx'] - row['median_interval']) <=
                    (tolerance * row['median_interval']),
        axis=1
    )

    # Clean up temporary columns
    df.drop(columns=['last_C_date', 'median_interval'], inplace=True, errors='ignore')

    return df

def is_salary_earner(group, min_transactions=3, threshold=0.7):
    """
    Determines if an account likely belongs to a salary earner.

    Args:
        group (pd.DataFrame): Transactions for a single account.
        min_transactions (int): Minimum number of transactions required.
        threshold (float): Minimum proportion of regular intervals required.

    Returns:
        bool: True if the account is likely a salary earner, False otherwise.
    """
    if len(group) < min_transactions:
        return False
    valid_ratio = group['is_regular_interval'].mean()
    return valid_ratio >= threshold
InĀ [23]:
interval_df = identify_regular_intervals(df)
salary_earners = interval_df.groupby('accountid').apply(is_salary_earner).reset_index(name='is_salary_earner')

interval_data = interval_df[
    (interval_df['is_regular_interval'] == True) &
    (interval_df['initiated_by'] == 'C')
]

display(interval_data.head())
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related days_since_last_C_trx is_regular_interval
299180 102834 0979170020 2024-03-14 2024-03-14 29680.0 T CI C NIP FRM IBRAHIM AUDI-INTBNKRET MASUR 0979170 False 0.0 True
299181 102835 0979170020 2024-03-15 2024-03-15 5050.0 T CI C NIP FRM ZAYYANU MUHAMMED-UnionMobile Trf from 0979170 False 0.0 True
299182 102828 0979170020 2024-03-16 2024-03-16 5350.0 T CI C NIP FRM KABIRU AMINU DANTANI-USSD KABIRU AMINU 0979170 False 0.0 True
299187 102833 0979170020 2024-03-17 2024-03-17 16350.0 T CI C NIP FRM 2348066269231-PWC USSD TF 78871581473 0979170 False 0.0 True
299188 102836 0979170020 2024-03-17 2024-03-17 10900.0 T CI C UNU,T,162583398982,ALIYU ABDULRAHMANLaNG 0979170 False 0.0 True
InĀ [24]:
interval_data.shape
Out[24]:
(25928, 13)
InĀ [25]:
interval_data['accountid'].nunique()
Out[25]:
78
InĀ [26]:
interval_data['customer_id'].nunique()
Out[26]:
78
InĀ [27]:
plot_trx_type_and_subtype(interval_data)
No description has been provided for this image

Hypothesis 3: Consistent Transaction Amounts¶

Goal: Identify potential salary earners by checking if they receive transactions with consistent amounts, similar to how salaries are typically paid.

Rationale: Salaries are often fixed or have minor fluctuations (taxes, deductions, etc.). We can use the coefficient of variation (CV) to measure the variability of transaction amounts for each account. A low CV suggests a stable income source, like a salary.

Implementation:

  • Calculate CV: For each account, we calculate the coefficient of variation (CV) of their credit ('C') transactions. The CV is the ratio of the standard deviation to the mean of the amounts.
  • Flag Consistency: If an account's CV is below a certain threshold (0%), we flag it as having consistent transaction amounts.
  • Identify Salary Earners: Accounts flagged as having consistent amounts are considered potential salary earners.
InĀ [28]:
def calculate_coefficient_of_variation(group):
    """
    Calculates the coefficient of variation (CV) for a group of transactions
    where 'initiated_by' is 'C'.

    Args:
        group (pd.DataFrame): Transactions for a single account (grouped by 'accountid').

    Returns:
        float: Coefficient of variation (std / mean). Returns NaN if mean is zero.
    """
    amounts = group[group['initiated_by'] == 'C']['amount']
    mean = amounts.mean()
    std = amounts.std(ddof=0)

    if mean == 0:
        return float('nan')
    return std / mean

def flag_consistent_amounts(group, cv_threshold=0.10):
    """
    Flags accounts with low variance in transaction amounts where 'initiated_by' is 'C'.

    Args:
        group (pd.DataFrame): Transactions for a single account.
        cv_threshold (float): Maximum allowed CV to flag as consistent (default: 0.10).

    Returns:
        pd.Series: Boolean series indicating if the transaction belongs to a consistent account.
    """
    # Filter for transactions initiated by 'C' before calculating CV
    filtered_group = group[group['initiated_by'] == 'C']
    cv = calculate_coefficient_of_variation(filtered_group)  # Pass filtered group
    is_consistent = cv <= cv_threshold if not pd.isna(cv) else False

    return pd.Series(
        [is_consistent] * len(group),
        index=group.index,
        name='is_consistent_amount'
    )

def identify_consistent_amount_accounts(df, cv_threshold=0.10):
    """
    Identifies accounts with consistent transaction amounts where 'initiated_by' is 'C'.

    Args:
        df (pd.DataFrame): Transaction data with columns ['accountid', 'amount', 'initiated_by'].
        cv_threshold (float): Maximum allowed CV (default: 0.10).

    Returns:
        pd.DataFrame: Input DataFrame with an added 'is_consistent_amount' column.
    """
    df = df.groupby('accountid').apply(
        lambda group: flag_consistent_amounts(group, cv_threshold)
    ).reset_index(level=0, drop=True)

    return df
InĀ [29]:
const_df = identify_consistent_amount_accounts(df, cv_threshold=0.10)
const_df = df.merge(const_df, left_index=True, right_index=True)
df['is_consistent_amount'] = const_df['is_consistent_amount']
const_data = const_df[
    (const_df['is_consistent_amount']) &
    (const_df['initiated_by']=='C')
]

display(const_data.head())
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related is_consistent_amount
2 513 1659404015 2024-03-18 2024-03-18 210.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True
8 519 1659404015 2024-03-21 2024-03-21 216.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True
9 520 2014836018 2024-03-22 2024-03-22 10200.0 T CI C NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES 2014836 False True
11 522 2014836018 2024-03-17 2024-03-17 13200.0 T CI C NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q 2014836 False True
54 565 1174712011 2024-03-21 2024-03-21 15450.0 T CI C NIP FRM NDALIMAN AISHATU SHABA-FBNMOBILE SAEED 1174712 False True
InĀ [30]:
const_data.shape
Out[30]:
(329979, 12)
InĀ [31]:
const_data['accountid'].nunique()
Out[31]:
21888
InĀ [32]:
const_data['customer_id'].nunique()
Out[32]:
21879
InĀ [33]:
plot_trx_type_and_subtype(const_data)
No description has been provided for this image

Hypothesis 4: Transaction Type/Subtype/Initiator Combination¶

This hypothesis identifies salary earners based on specific combinations of transaction attributes (trx_type, trx_subtype, initiated_by).

Criteria:¶

  1. Transaction Type (trx_type):

    • Includes T (Transfer) or C (Cash), capturing both electronic transfers and cash-based income.
  2. Transaction Subtype (trx_subtype):

    • Includes BI (Bank Initiated), I (Inbound), or BS (Basic Salary):
      • BI: Bank-initiated transactions, often used for automated credits like interest or fees but also applicable to structured payments.
      • I: Inbound payments, representing incoming funds from external sources (e.g., employer transfers).
      • BS: Explicitly labeled as "Basic Salary," directly indicating wage or salary payments.
      • CI: "Cash In", salaries pain in cash to a bank.
  3. Initiated By (initiated_by):

    • Restricted to C (Creditor/Bank/System), ensuring the transaction is initiated externally (e.g., by an employer or banking system) rather than by the account holder.
  4. Amount:

    • Requires a positive value (amount > 0), reflecting a credit to the account.

Rationale:¶

  • Salaries are typically credited into accounts via structured mechanisms such as bank transfers or direct deposits.
InĀ [34]:
def flag_salary_type_transactions(df):
    """
    Flags transactions that match the salary criteria based on type, subtype, and initiator.

    Args:
        df (pd.DataFrame): Transaction data with columns ['trx_type', 'trx_subtype', 'initiated_by', 'amount'].

    Returns:
        pd.DataFrame: Input DataFrame with an added 'is_salary_type' column.
    """
    df['is_salary_type'] = (
        ((df['trx_type'] == 'T') | (df['trx_type'] == 'C')) &
        ((df['trx_subtype'] == 'BI') |  (df['trx_subtype'] == 'I') |  (df['trx_subtype'] == 'BS') |  (df['trx_subtype'] == 'CI')) &
        (df['initiated_by'] == 'C') &
        (df['amount'] > 0)
    )
    return df


def is_salary_earner_by_type(group, min_transactions=3, threshold=0.7):
    """
    Determines if an account likely belongs to a salary earner based on transaction type criteria.

    Args:
        group (pd.DataFrame): Transactions for a single account.
        min_transactions (int): Minimum transactions required to qualify (default: 3).
        threshold (float): Minimum proportion of salary-type transactions (default: 0.7).

    Returns:
        bool: True if the account meets the criteria, False otherwise.
    """
    if len(group) < min_transactions:
        return False
    valid_ratio = group['is_salary_type'].mean()
    return valid_ratio >= threshold
InĀ [35]:
trx_df = flag_salary_type_transactions(df)
trx_data = trx_df[trx_df['is_salary_type']]
trx_data.head()
Out[35]:
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related is_consistent_amount is_salary_type
2 513 1659404015 2024-03-18 2024-03-18 210.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True True
8 519 1659404015 2024-03-21 2024-03-21 216.0 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True True
9 520 2014836018 2024-03-22 2024-03-22 10200.0 T CI C NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES 2014836 False True True
11 522 2014836018 2024-03-17 2024-03-17 13200.0 T CI C NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q 2014836 False True True
17 528 2089119018 2024-03-22 2024-03-22 35000.0 T CI C web:ann 2089119 False False True
InĀ [36]:
trx_data.shape
Out[36]:
(991655, 13)
InĀ [37]:
trx_data['accountid'].nunique()
Out[37]:
35075
InĀ [38]:
trx_data['customer_id'].nunique()
Out[38]:
35051
InĀ [39]:
plot_trx_type_and_subtype(trx_data)
No description has been provided for this image

Investigating all Hypothesis¶

InĀ [137]:
def plot_hypothesis_overlap(hypothesis1_df, hypothesis2_df,
                            hypothesis3_df, hypothesis4_df,
                            account_col='accountid'):
    """
    Plots a Venn diagram showing overlap between the hypotheses.

    Args:
        hypothesis1_df (pd.DataFrame): DataFrame with Hypothesis 1 results
        hypothesis2_df (pd.DataFrame): DataFrame with Hypothesis 2 results
        hypothesis3_df (pd.DataFrame): DataFrame with Hypothesis 3 results
        hypothesis4_df (pd.DataFrame): DataFrame with Hypothesis 4 results
        account_col (str): Account identifier column.
    """
    set1 = set(hypothesis2_df[account_col][hypothesis2_df['is_regular_interval']])
    set2 = set(hypothesis3_df[account_col][hypothesis3_df['is_consistent_amount']])
    set3 = set(hypothesis1_df[account_col][hypothesis1_df['is_salary_related']])
    set4 = set(hypothesis4_df[account_col][hypothesis4_df['is_salary_type']])


    plt.figure(figsize=(10, 10))
    venn3([set2, set3, set4], set_labels=('Consistent Amount',
                                                'Salary Description', 'Transaction Type'))
    plt.title('Overlap Between Hypotheses')
    plt.show()
InĀ [138]:
plot_hypothesis_overlap(desc_data, interval_data, const_data, trx_data)
No description has been provided for this image
InĀ [139]:
def filter_venn_section(df, **kwargs):
    """
    Filters accounts based on specified combinations of hypothesis flags.

    Args:
        df (pd.DataFrame): DataFrame with columns ['is_salary_related', 'is_consistent_amount', 'is_salary_type'].
        **kwargs: Key-value pairs specifying the desired state of each hypothesis flag.
                  For example: {'is_salary_related': True, 'is_consistent_amount': False}.

    Returns:
        pd.DataFrame: Filtered accounts matching the specified Venn section.
    """
    valid_columns = {'is_salary_related', 'is_consistent_amount', 'is_salary_type'}
    df1 = df[df['initiated_by']=='C']
    invalid_keys = set(kwargs.keys()) - valid_columns
    if invalid_keys:
        raise ValueError(f"Invalid keys: {invalid_keys}. Valid keys are {valid_columns}.")

    condition = pd.Series([True] * len(df1), index=df1.index)
    for key, value in kwargs.items():
        condition &= (df1[key] == value)

    return df1[condition]
InĀ [140]:
green_section = filter_venn_section(
    df,
    is_salary_related=True,
    is_consistent_amount=False,
    is_salary_type=True
)

display(green_section.head(10))
plot_trx_type_and_subtype(green_section)
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related is_consistent_amount is_salary_type
21 532 2089119018 2024-03-25 2024-03-25 62208.00 T BI C March 2018 Allowance 2089119 True False True
63 574 2099839010 2024-03-24 2024-03-24 45396.00 T BI C March 2018 Allowance 2099839 True False True
1208 1460 2274196017 2024-03-21 2024-03-21 24401.52 T CI C wages/ARIBIFO BABATUNDE 2274196 True False True
1214 1466 2175103017 2024-03-23 2024-03-23 34300.00 T BI C March 2018 Allowance 2175103 True False True
1652 1903 2037208018 2024-03-17 2024-03-17 49830.00 T CI C Online : March 2018 half salary 2037208 True False True
1813 2066 2071298017 2024-03-21 2024-03-21 61056.00 T BI C March 2018 Allowance 2071298 True False True
5184 5423 1741580016 2024-03-20 2024-03-20 30282.00 T BI C March 2018 Allowance 1741580 True False True
5349 5587 1606101019 2024-03-22 2024-03-22 33000.00 T CI C web:January and February salary Abosi 1606101 True False True
6161 6396 1986459014 2024-03-19 2024-03-19 70200.00 T BI C SALARY PAYMENTS 1986459 True False True
6863 7097 1881876011 2024-03-22 2024-03-22 63360.00 T BI C March 2018 Allowance 1881876 True False True
No description has been provided for this image

Majority of these are Allowwances and Wages.

They have Salary descriptions but are not consistent

InĀ [141]:
yellow_section = filter_venn_section(
    df,
    is_salary_related=False,
    is_consistent_amount=True,
    is_salary_type=True
)

display(yellow_section.head(10))
plot_trx_type_and_subtype(yellow_section)
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related is_consistent_amount is_salary_type
2 513 1659404015 2024-03-18 2024-03-18 210.0000 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True True
8 519 1659404015 2024-03-21 2024-03-21 216.0000 T CI C ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... 1659404 False True True
9 520 2014836018 2024-03-22 2024-03-22 10200.0000 T CI C NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES 2014836 False True True
11 522 2014836018 2024-03-17 2024-03-17 13200.0000 T CI C NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q 2014836 False True True
54 565 1174712011 2024-03-21 2024-03-21 15450.0000 T CI C NIP FRM NDALIMAN AISHATU SHABA-FBNMOBILE SAEED 1174712 False True True
60 571 1999974016 2024-03-20 2024-03-20 180.7032 T BI C REVERSAL ON ISSUER FEES CHARGE 1999974 False True True
70 581 1353446016 2024-03-22 2024-03-22 20330.0000 T CI C Online : mezie 1353446 False True True
121 632 1726415010 2024-03-16 2024-03-16 10833.2000 T CI C CSH DEPOSIT BY : OJO RONKE 1726415 False True True
122 633 1510438010 2024-03-15 2024-03-15 84240.0000 T CI C CSH DEPOSIT BY : NWAIWU HAPPINESS 1510438 False True True
123 634 1732503011 2024-03-23 2024-03-23 21200.0000 T CI C NIP FRM VINCENT CHUKWUDI MACFRIDO-Inflow from 1732503 False True True
No description has been provided for this image

These are potential Salary/ Monthly income earners.

Majority of which are cash deposit, indicating cash handling

InĀ [142]:
# Get accounts flagged by all three hypotheses
all_three_hypotheses = filter_venn_section(
    df,
    is_salary_related=True,
    is_consistent_amount=True,
    is_salary_type=True
)

display(all_three_hypotheses.head(10))
plot_trx_type_and_subtype(all_three_hypotheses)
id accountid trx_start_date trx_end_date amount trx_type trx_subtype initiated_by description customer_id is_salary_related is_consistent_amount is_salary_type
779 1034 1776036014 2024-03-25 2024-03-25 366.8219 T BS C SALARY FOR MAR 2018 1776036 True True True
2050 2301 2182865010 2024-03-18 2024-03-18 39375.0000 T BI C GenieNG STAFF DEC BONUS 2182865 True True True
5324 5562 0853687015 2024-03-22 2024-03-22 16451.8120 T CI C N-10022744651/Sigma _Monthly pension from UPCL 0853687 True True True
6126 6361 1533377013 2024-03-23 2024-03-23 22470.0000 T CI C MPCS DIVIDEND 1533377 True True True
7080 7313 1757605017 2024-03-18 2024-03-18 206034.8288 T CI C COP08DPSS959770;54468994;NSUK FEB 2018 SALARY-NSU 1757605 True True True
7865 8095 1893248017 2024-03-21 2024-03-21 7518.1284 T I C MARCH 18 PENSION PAYMENT=FPCNL 1893248 True True True
8242 8471 1298420010 2024-03-21 2024-03-21 186329.6344 T CI C 137/MAR2018 Salary/3152861/ LANG 1298420 True True True
9362 9586 1071766010 2024-03-18 2024-03-18 11698.6400 T CI C N-10022744037/Sigma _Monthly pension from UPCL 1071766 True True True
9369 9593 0966021014 2024-03-19 2024-03-19 42230.0000 T CI C MPCS DIVIDEND 0966021 True True True
9469 9693 1144064018 2024-03-18 2024-03-18 35999.0703 T I C FEB 2018 ALLOWANCE=NIGER STATE 1144064 True True True
No description has been provided for this image

Selecting Monthly Income Earners¶

InĀ [143]:
def generate_salary_earners_table(all_three_hypotheses):

    results = []
    for accountid, group in all_three_hypotheses.groupby('accountid'):
        # Calculate required metrics
        num_months = len(group)
        last_6_months = group[group['trx_start_date'] >= (datetime.now() - timedelta(days=180))]
        least_inflow = last_6_months['amount'].min()
        avg_salary = group['amount'].mean()

        # Estimated next salary
        # Calculate days_since_last_trx within the loop
        group['days_since_last_trx'] = group['trx_start_date'].diff().dt.days
        median_interval = group['days_since_last_trx'].median()

        last_date = group['trx_start_date'].max()
        next_date = last_date + timedelta(days=median_interval)
        next_amount = avg_salary

        # Boolean flags
        days_since_last = (datetime.now() - last_date).days
        has_45d = days_since_last <= 45
        has_2m = len(group[group['trx_start_date'] >= (datetime.now() - timedelta(days=60))]) >= 2

        results.append({
            'accountid': accountid,
            'num_months': num_months,
            'least_inflow_6m': least_inflow,
            'avg_monthly_salary': avg_salary,
            'estimated_next_amount': next_amount,
            'estimated_next_date': next_date,
            '45daysalary': has_45d,
            '2monthssalary': has_2m
        })

    final_df = pd.DataFrame(results)
    final_df = final_df.dropna()
    return final_df
InĀ [144]:
final_table = generate_salary_earners_table(all_three_hypotheses)

# Display results
print(f"Found {final_table['accountid'].nunique()} verified salary earners")
display(final_table.head(20))
Found 383 verified salary earners
accountid num_months least_inflow_6m avg_monthly_salary estimated_next_amount estimated_next_date 45daysalary 2monthssalary
0 0171115023 14 4.251146e+03 4.402972e+03 4.402972e+03 2025-04-12 True False
1 0194052022 14 2.658945e+04 2.738234e+04 2.738234e+04 2025-04-14 True False
2 0532452020 14 8.000000e+04 8.371429e+04 8.371429e+04 2025-04-15 True False
3 0623122023 14 4.100378e+03 4.297567e+03 4.297567e+03 2025-04-16 True False
4 0676572026 14 4.232361e+04 4.459095e+04 4.459095e+04 2025-04-15 True False
5 0761118025 14 1.020951e+04 1.072728e+04 1.072728e+04 2025-04-09 True False
6 0820472015 14 2.023256e+05 2.105433e+05 2.105433e+05 2025-04-18 True True
7 0835181025 14 4.288626e+04 4.376991e+04 4.376991e+04 2025-04-10 True False
8 0835200018 14 4.085767e+04 4.211659e+04 4.211659e+04 2025-04-15 True False
9 0846827013 14 3.794400e+02 3.882086e+02 3.882086e+02 2025-04-11 True False
10 0852762012 14 2.064837e+04 2.138581e+04 2.138581e+04 2025-04-12 True False
11 0853195013 14 9.297402e+03 9.476199e+03 9.476199e+03 2025-04-16 True False
12 0853687015 14 1.597724e+04 1.646311e+04 1.646311e+04 2025-04-17 True False
13 0856304027 14 7.754812e+04 8.092700e+04 8.092700e+04 2025-04-16 True False
14 0875566019 14 5.871242e+03 6.048038e+03 6.048038e+03 2025-04-15 True False
15 0889240019 14 1.150220e+06 1.189266e+06 1.189266e+06 2025-04-22 True False
16 0890507013 14 2.233007e+04 2.354227e+04 2.354227e+04 2025-04-14 True False
17 0897678011 14 5.514127e+04 5.663260e+04 5.663260e+04 2025-04-19 True False
18 0903403013 14 6.701709e+04 6.924467e+04 6.924467e+04 2025-04-13 True False
19 0903407011 14 4.643482e+04 4.958575e+04 4.958575e+04 2025-04-13 True False
InĀ [145]:
likely_salary_earner = pd.concat([yellow_section, green_section])
likely_salary_earner = likely_salary_earner.drop_duplicates(subset=['id'])
likely_salary_earner = generate_salary_earners_table(likely_salary_earner)

# Display results
print(f"Found {likely_salary_earner['accountid'].nunique()} likely salary earners")
display(likely_salary_earner.head(20))
Found 20797 likely salary earners
accountid num_months least_inflow_6m avg_monthly_salary estimated_next_amount estimated_next_date 45daysalary 2monthssalary
0 0022324028 14 40800.00 4.162857e+04 4.162857e+04 2025-04-17 True False
1 0022848049 14 1395360.00 1.450080e+06 1.450080e+06 2025-04-14 True False
2 0025339021 14 553480.00 5.789229e+05 5.789229e+05 2025-04-18 True False
3 0027643012 14 106000.00 1.051429e+05 1.051429e+05 2025-04-19 True False
4 0050282020 14 15000.00 1.571786e+04 1.571786e+04 2025-04-19 True False
5 0051390023 14 153000.00 1.588929e+05 1.588929e+05 2025-04-15 True False
6 0060968040 14 202.00 2.081429e+02 2.081429e+02 2025-04-18 True False
7 0103261028 14 92000.00 9.614000e+04 9.614000e+04 2025-04-16 True False
8 0104926023 14 40000.00 4.222857e+04 4.222857e+04 2025-04-18 True False
9 0108574013 14 1010.00 1.055000e+03 1.055000e+03 2025-04-19 True False
10 0109992036 14 4040.00 4.177143e+03 4.177143e+03 2025-04-17 True False
11 0110206027 28 10100.00 1.049286e+04 1.049286e+04 2025-03-27 True True
12 0112600038 14 5100.00 5.275000e+03 5.275000e+03 2025-04-09 True False
13 0160518039 14 40000.00 4.191429e+04 4.191429e+04 2025-04-18 True True
14 0162780032 14 5000.00 5.175000e+03 5.175000e+03 2025-04-21 True False
15 0163308020 14 2525.00 2.625000e+03 2.625000e+03 2025-04-11 True False
16 0163704024 14 323200.00 3.344000e+05 3.344000e+05 2025-04-19 True False
17 0164479033 14 10200.00 1.051429e+04 1.051429e+04 2025-04-16 True False
18 0170195039 14 3923.16 4.127725e+03 4.127725e+03 2025-04-14 True False
19 0171813037 14 10000.00 1.040000e+04 1.040000e+04 2025-04-10 True False
InĀ [146]:
def analyze_salary_earners(final_df):
    """
    Analyzes salary earners and identifies high earners (>=10k predicted salary).

    Args:
        final_df (pd.DataFrame): DataFrame containing salary earner information.

    Returns:
        pd.DataFrame: DataFrame with high earner statistics, including count and minimum inflows.
    """
    high_earners = final_df[final_df['estimated_next_amount'] >= 10000]
    high_earners['least_inflow_6m'] = high_earners['least_inflow_6m']
    count_high = len(high_earners)

    high_earner_details = high_earners[['accountid', 'least_inflow_6m']].reset_index(drop=True)

    return high_earner_details, count_high
InĀ [147]:
high_earner_details_df, total_high_earners = analyze_salary_earners(final_table)

print(f"\nTotal High Earners: {total_high_earners}")
display(high_earner_details_df)
Total High Earners: 307
accountid least_inflow_6m
0 0194052022 26589.4500
1 0532452020 80000.0000
2 0676572026 42323.6100
3 0761118025 10209.5100
4 0820472015 202325.6088
... ... ...
302 2253348011 25721.5286
303 2253502017 26780.0000
304 2253954018 25391.8343
305 2267310019 21000.0000
306 2271021019 52800.0000

307 rows Ɨ 2 columns

InĀ [148]:
high_earner_details_df.to_csv('high_earner_details.csv', index=False)
likely_salary_earner.to_csv('likely_salary_earner.csv', index=False)
final_table.to_csv('final_table.csv', index=False)

ML Prediction¶

InĀ [164]:
def add_feature_engineering(df):
    """
    Engineers new features to the input DataFrame for salary prediction.

    Adds features like month, month sequence, one-hot encoded transaction type,
    3-month rolling sum, and 3-month rolling average of transaction amounts.

    Args:
        df (pd.DataFrame): The input DataFrame containing transaction data.

    Returns:
        pd.DataFrame: The DataFrame with engineered features added.
    """

    df['month'] = df['trx_start_date'].dt.month
    df['month_seq'] = df.groupby(['accountid', 'month']).ngroup() + 1

    # Categorical encoding: one-hot encode trx_type
    encoder = OneHotEncoder(sparse_output=False)
    encoded_trx_type = encoder.fit_transform(df[['trx_type']])
    encoded_df = pd.DataFrame(encoded_trx_type, columns=encoder.get_feature_names_out(['trx_type']))
    df = pd.concat([df, encoded_df], axis=1)

    # Rolling statistics: sort by account and date
    df = df.sort_values(['accountid', 'trx_start_date'])
    df['rolling_sum_3m'] = df.groupby('accountid')['amount'].rolling(window=3,
                                                                     min_periods=1).sum().reset_index(0, drop=True)
    df['rolling_avg_3m'] = df.groupby('accountid')['amount'].rolling(window=3,
                                                                     min_periods=1).mean().reset_index(0, drop=True)

    return df

def prepare_data(df_transactions, accounts):
    """
    Prepares transaction data for training and testing a salary prediction model.

    Filters transactions for specified accounts, performs feature engineering,
    aggregates data monthly, filters for accounts with sufficient data, and
    creates training and testing sets using a sliding window approach.

    Args:
        df_transactions (pd.DataFrame): The input DataFrame containing all transaction data.
        accounts (list): A list of account IDs to include in the data preparation.

    Returns:
        tuple: A tuple containing the training and testing data as NumPy arrays:
               (X_train, y_train, X_test, y_test).
    """

    df_filtered = df_transactions[df_transactions['accountid'].isin(accounts)].copy()
    print(f"Filtered data for {len(accounts)} accounts.")
    print(f"Total transactions: {len(df_filtered)}")

    # Drop unnecessary columns
    df_filtered = df_filtered.drop(['description', 'id', 'customer_id',
                                    'trx_end_date', 'is_salary_related',
                                    'is_consistent_amount', 'is_salary_type'], axis=1)

    # Add feature engineering
    df_filtered = add_feature_engineering(df_filtered)

    # Aggregate monthly data with new features
    agg_funcs = {
        'amount': 'mean',
        'rolling_sum_3m': 'last',
        'rolling_avg_3m': 'last',
        'month': 'first'
    }
    encoded_cols = [col for col in df_filtered.columns if col.startswith('trx_type_')]
    for col in encoded_cols:
        agg_funcs[col] = 'sum'

    monthly_data = df_filtered.groupby(['accountid', 'month_seq']).agg(agg_funcs).reset_index()

    # Filter accounts with at least 12 months
    account_month_counts = monthly_data.groupby('accountid')['month_seq'].max()
    valid_accounts = account_month_counts[account_month_counts >= 12].index
    monthly_data = monthly_data[monthly_data['accountid'].isin(valid_accounts)]

    # Create training and testing sequences
    X_train, y_train, X_test, y_test = [], [], [], []
    feature_cols = ['accountid', 'amount', 'rolling_sum_3m', 'rolling_avg_3m',
                    'month'] + encoded_cols

    for account in valid_accounts:
        account_data = monthly_data[monthly_data['accountid'] == account].sort_values('month_seq')

        # Check if account has enough data for training and testing sequences
        if len(account_data) >= 12:  # Ensure at least 12 months of data
            for t in range(5, 8):
                X_train.append(account_data.iloc[t-5:t][feature_cols].values.flatten())
                y_train.append(account_data['amount'].iloc[t])
            for t in range(8, 12):
                X_test.append(account_data.iloc[t-5:t][feature_cols].values.flatten())
                y_test.append(account_data['amount'].iloc[t])
        else:
            print(f"Skipping account {account} due to insufficient data (less than 12 months).")

    return np.array(X_train), np.array(y_train), np.array(X_test), np.array(y_test)
InĀ [Ā ]:
def train_model(X_train, y_train, X_test, y_test):
    """
    Trains and evaluates a Random Forest Regressor for salary prediction.

    Scales the input features using StandardScaler, trains the model,
    predicts on the test set, and calculates evaluation metrics
    (MAE, RMSE, R-squared).

    Args:
        X_train (np.ndarray): Training data features.
        y_train (np.ndarray): Training data target (salary).
        X_test (np.ndarray): Testing data features.
        y_test (np.ndarray): Testing data target (salary).

    Returns:
        tuple: A tuple containing the trained model and the scaler object:
               (model, scaler).
    """

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train Random Forest model
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y_train)

    # Evaluate model
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}, R-squared: {r2:.2f}")
    return model, scaler

Consistent salary earners¶

InĀ [168]:
consistent_accounts = final_table['accountid'].unique()
X_train_cons, y_train_cons, X_test_cons, y_test_cons = prepare_data(df, consistent_accounts)
if len(X_train_cons) > 0:
    model_cons, scaler_cons = train_model(X_train_cons, y_train_cons, X_test_cons, y_test_cons)
    print("Model trained for consistent salary earners.")
else:
    print("No accounts with sufficient data for consistent salary earners.")
Filtered data for 383 accounts.
Total transactions: 22162
Skipping account 0846827013 due to insufficient data (less than 12 months).
Skipping account 1214711015 due to insufficient data (less than 12 months).
Skipping account 1218765016 due to insufficient data (less than 12 months).
Skipping account 1377321016 due to insufficient data (less than 12 months).
Skipping account 1486608013 due to insufficient data (less than 12 months).
Skipping account 1910312013 due to insufficient data (less than 12 months).
Skipping account 1949225012 due to insufficient data (less than 12 months).
MAE: 1909.79, RMSE: 4610.21, R-squared: 0.96
Model trained for consistent salary earners.
InĀ [158]:
X_test_cons_scaled = scaler_cons.transform(X_test_cons)
y_pred = model_cons.predict(X_test_cons_scaled)

plt.figure(figsize=(10, 5))
plt.scatter(y_test_cons, y_pred, alpha=0.5)
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual vs. Predicted Salary")
plt.plot([min(y_test_cons), max(y_test_cons)], [min(y_test_cons), max(y_test_cons)], 'r--')
plt.show()
No description has been provided for this image

Inconsistent salary earners¶

InĀ [170]:
inconsistent_accounts = likely_salary_earner['accountid'].unique()
X_train_incons, y_train_incons, X_test_incons, y_test_incons = prepare_data(df, inconsistent_accounts)
if len(X_train_incons) > 0:
    print("\nTraining model for inconsistent salary earners...")
    model_incons, scaler_incons = train_model(X_train_incons, y_train_incons, X_test_incons, y_test_incons)
else:
    print("No accounts with sufficient data for inconsistent salary earners.")
Filtered data for 20797 accounts.
Total transactions: 1508309
Skipping account 0164479033 due to insufficient data (less than 12 months).
Skipping account 0252542021 due to insufficient data (less than 12 months).
Skipping account 0793130028 due to insufficient data (less than 12 months).
Skipping account 0811893018 due to insufficient data (less than 12 months).
Skipping account 0820374023 due to insufficient data (less than 12 months).
Skipping account 0833877010 due to insufficient data (less than 12 months).
Skipping account 0841163017 due to insufficient data (less than 12 months).
Skipping account 0844752012 due to insufficient data (less than 12 months).
Skipping account 0869552015 due to insufficient data (less than 12 months).
Skipping account 0881519038 due to insufficient data (less than 12 months).
Skipping account 0889291028 due to insufficient data (less than 12 months).
Skipping account 0900942018 due to insufficient data (less than 12 months).
Skipping account 0916761018 due to insufficient data (less than 12 months).
Skipping account 0969843017 due to insufficient data (less than 12 months).
Skipping account 0999364014 due to insufficient data (less than 12 months).
Skipping account 1084818010 due to insufficient data (less than 12 months).
Skipping account 1096189016 due to insufficient data (less than 12 months).
Skipping account 1126705014 due to insufficient data (less than 12 months).
Skipping account 1128137015 due to insufficient data (less than 12 months).
Skipping account 1143631017 due to insufficient data (less than 12 months).
Skipping account 1143803016 due to insufficient data (less than 12 months).
Skipping account 1164852015 due to insufficient data (less than 12 months).
Skipping account 1167885018 due to insufficient data (less than 12 months).
Skipping account 1170449014 due to insufficient data (less than 12 months).
Skipping account 1170914013 due to insufficient data (less than 12 months).
Skipping account 1183525013 due to insufficient data (less than 12 months).
Skipping account 1202640039 due to insufficient data (less than 12 months).
Skipping account 1204052012 due to insufficient data (less than 12 months).
Skipping account 1216036017 due to insufficient data (less than 12 months).
Skipping account 1219147015 due to insufficient data (less than 12 months).
Skipping account 1219701013 due to insufficient data (less than 12 months).
Skipping account 1222737014 due to insufficient data (less than 12 months).
Skipping account 1231595016 due to insufficient data (less than 12 months).
Skipping account 1235579018 due to insufficient data (less than 12 months).
Skipping account 1239695015 due to insufficient data (less than 12 months).
Skipping account 1240446015 due to insufficient data (less than 12 months).
Skipping account 1253643018 due to insufficient data (less than 12 months).
Skipping account 1256392018 due to insufficient data (less than 12 months).
Skipping account 1261631010 due to insufficient data (less than 12 months).
Skipping account 1271364010 due to insufficient data (less than 12 months).
Skipping account 1290617012 due to insufficient data (less than 12 months).
Skipping account 1298675023 due to insufficient data (less than 12 months).
Skipping account 1299084015 due to insufficient data (less than 12 months).
Skipping account 1303872016 due to insufficient data (less than 12 months).
Skipping account 1311306013 due to insufficient data (less than 12 months).
Skipping account 1323543017 due to insufficient data (less than 12 months).
Skipping account 1327803016 due to insufficient data (less than 12 months).
Skipping account 1336761011 due to insufficient data (less than 12 months).
Skipping account 1348526013 due to insufficient data (less than 12 months).
Skipping account 1358686015 due to insufficient data (less than 12 months).
Skipping account 1379381012 due to insufficient data (less than 12 months).
Skipping account 1381224019 due to insufficient data (less than 12 months).
Skipping account 1385952015 due to insufficient data (less than 12 months).
Skipping account 1391474017 due to insufficient data (less than 12 months).
Skipping account 1391810019 due to insufficient data (less than 12 months).
Skipping account 1399454019 due to insufficient data (less than 12 months).
Skipping account 1401542015 due to insufficient data (less than 12 months).
Skipping account 1403496019 due to insufficient data (less than 12 months).
Skipping account 1414734016 due to insufficient data (less than 12 months).
Skipping account 1425276024 due to insufficient data (less than 12 months).
Skipping account 1441261011 due to insufficient data (less than 12 months).
Skipping account 1459159010 due to insufficient data (less than 12 months).
Skipping account 1465847019 due to insufficient data (less than 12 months).
Skipping account 1468334017 due to insufficient data (less than 12 months).
Skipping account 1478452017 due to insufficient data (less than 12 months).
Skipping account 1483551015 due to insufficient data (less than 12 months).
Skipping account 1493191018 due to insufficient data (less than 12 months).
Skipping account 1496522019 due to insufficient data (less than 12 months).
Skipping account 1503760018 due to insufficient data (less than 12 months).
Skipping account 1507901017 due to insufficient data (less than 12 months).
Skipping account 1514231013 due to insufficient data (less than 12 months).
Skipping account 1515986011 due to insufficient data (less than 12 months).
Skipping account 1526777019 due to insufficient data (less than 12 months).
Skipping account 1533413016 due to insufficient data (less than 12 months).
Skipping account 1539422018 due to insufficient data (less than 12 months).
Skipping account 1539760011 due to insufficient data (less than 12 months).
Skipping account 1544487017 due to insufficient data (less than 12 months).
Skipping account 1550002013 due to insufficient data (less than 12 months).
Skipping account 1561203010 due to insufficient data (less than 12 months).
Skipping account 1574299013 due to insufficient data (less than 12 months).
Skipping account 1576866013 due to insufficient data (less than 12 months).
Skipping account 1581587017 due to insufficient data (less than 12 months).
Skipping account 1616267013 due to insufficient data (less than 12 months).
Skipping account 1623707012 due to insufficient data (less than 12 months).
Skipping account 1630380013 due to insufficient data (less than 12 months).
Skipping account 1631558011 due to insufficient data (less than 12 months).
Skipping account 1635474012 due to insufficient data (less than 12 months).
Skipping account 1638302017 due to insufficient data (less than 12 months).
Skipping account 1655657019 due to insufficient data (less than 12 months).
Skipping account 1657682019 due to insufficient data (less than 12 months).
Skipping account 1664382016 due to insufficient data (less than 12 months).
Skipping account 1669016011 due to insufficient data (less than 12 months).
Skipping account 1670290015 due to insufficient data (less than 12 months).
Skipping account 1692703016 due to insufficient data (less than 12 months).
Skipping account 1703110011 due to insufficient data (less than 12 months).
Skipping account 1713919015 due to insufficient data (less than 12 months).
Skipping account 1717765014 due to insufficient data (less than 12 months).
Skipping account 1720096019 due to insufficient data (less than 12 months).
Skipping account 1727968018 due to insufficient data (less than 12 months).
Skipping account 1728612019 due to insufficient data (less than 12 months).
Skipping account 1728756018 due to insufficient data (less than 12 months).
Skipping account 1738694016 due to insufficient data (less than 12 months).
Skipping account 1741834018 due to insufficient data (less than 12 months).
Skipping account 1746903018 due to insufficient data (less than 12 months).
Skipping account 1753350010 due to insufficient data (less than 12 months).
Skipping account 1762185010 due to insufficient data (less than 12 months).
Skipping account 1762728017 due to insufficient data (less than 12 months).
Skipping account 1770647014 due to insufficient data (less than 12 months).
Skipping account 1775250015 due to insufficient data (less than 12 months).
Skipping account 1777074015 due to insufficient data (less than 12 months).
Skipping account 1778623018 due to insufficient data (less than 12 months).
Skipping account 1793147014 due to insufficient data (less than 12 months).
Skipping account 1803595017 due to insufficient data (less than 12 months).
Skipping account 1806158019 due to insufficient data (less than 12 months).
Skipping account 1815552019 due to insufficient data (less than 12 months).
Skipping account 1818659012 due to insufficient data (less than 12 months).
Skipping account 1819457019 due to insufficient data (less than 12 months).
Skipping account 1831918011 due to insufficient data (less than 12 months).
Skipping account 1832218017 due to insufficient data (less than 12 months).
Skipping account 1833745019 due to insufficient data (less than 12 months).
Skipping account 1836474015 due to insufficient data (less than 12 months).
Skipping account 1840881018 due to insufficient data (less than 12 months).
Skipping account 1842141013 due to insufficient data (less than 12 months).
Skipping account 1845647011 due to insufficient data (less than 12 months).
Skipping account 1846326018 due to insufficient data (less than 12 months).
Skipping account 1847430013 due to insufficient data (less than 12 months).
Skipping account 1851305019 due to insufficient data (less than 12 months).
Skipping account 1860100018 due to insufficient data (less than 12 months).
Skipping account 1861508013 due to insufficient data (less than 12 months).
Skipping account 1865247013 due to insufficient data (less than 12 months).
Skipping account 1876925010 due to insufficient data (less than 12 months).
Skipping account 1883300019 due to insufficient data (less than 12 months).
Skipping account 1898189014 due to insufficient data (less than 12 months).
Skipping account 1898741018 due to insufficient data (less than 12 months).
Skipping account 1913075018 due to insufficient data (less than 12 months).
Skipping account 1915864012 due to insufficient data (less than 12 months).
Skipping account 1928120013 due to insufficient data (less than 12 months).
Skipping account 1928937013 due to insufficient data (less than 12 months).
Skipping account 1937668012 due to insufficient data (less than 12 months).
Skipping account 1938595016 due to insufficient data (less than 12 months).
Skipping account 1940073014 due to insufficient data (less than 12 months).
Skipping account 1943190011 due to insufficient data (less than 12 months).
Skipping account 1947357016 due to insufficient data (less than 12 months).
Skipping account 1954123017 due to insufficient data (less than 12 months).
Skipping account 1966814019 due to insufficient data (less than 12 months).
Skipping account 1971739017 due to insufficient data (less than 12 months).
Skipping account 1976668013 due to insufficient data (less than 12 months).
Skipping account 1985946014 due to insufficient data (less than 12 months).
Skipping account 1997076019 due to insufficient data (less than 12 months).
Skipping account 2000510018 due to insufficient data (less than 12 months).
Skipping account 2001644017 due to insufficient data (less than 12 months).
Skipping account 2005088013 due to insufficient data (less than 12 months).
Skipping account 2012981017 due to insufficient data (less than 12 months).
Skipping account 2013426012 due to insufficient data (less than 12 months).
Skipping account 2014958026 due to insufficient data (less than 12 months).
Skipping account 2015981018 due to insufficient data (less than 12 months).
Skipping account 2021883018 due to insufficient data (less than 12 months).
Skipping account 2023269012 due to insufficient data (less than 12 months).
Skipping account 2023914013 due to insufficient data (less than 12 months).
Skipping account 2024595011 due to insufficient data (less than 12 months).
Skipping account 2036492018 due to insufficient data (less than 12 months).
Skipping account 2042103012 due to insufficient data (less than 12 months).
Skipping account 2047034029 due to insufficient data (less than 12 months).
Skipping account 2051077016 due to insufficient data (less than 12 months).
Skipping account 2056660013 due to insufficient data (less than 12 months).
Skipping account 2058877019 due to insufficient data (less than 12 months).
Skipping account 2065160018 due to insufficient data (less than 12 months).
Skipping account 2065594019 due to insufficient data (less than 12 months).
Skipping account 2067607014 due to insufficient data (less than 12 months).
Skipping account 2069255017 due to insufficient data (less than 12 months).
Skipping account 2071562020 due to insufficient data (less than 12 months).
Skipping account 2071649028 due to insufficient data (less than 12 months).
Skipping account 2071850011 due to insufficient data (less than 12 months).
Skipping account 2074482015 due to insufficient data (less than 12 months).
Skipping account 2076951016 due to insufficient data (less than 12 months).
Skipping account 2078709015 due to insufficient data (less than 12 months).
Skipping account 2092373012 due to insufficient data (less than 12 months).
Skipping account 2093437012 due to insufficient data (less than 12 months).
Skipping account 2094447016 due to insufficient data (less than 12 months).
Skipping account 2097386015 due to insufficient data (less than 12 months).
Skipping account 2102831015 due to insufficient data (less than 12 months).
Skipping account 2109718016 due to insufficient data (less than 12 months).
Skipping account 2112284010 due to insufficient data (less than 12 months).
Skipping account 2113979023 due to insufficient data (less than 12 months).
Skipping account 2114096019 due to insufficient data (less than 12 months).
Skipping account 2114660014 due to insufficient data (less than 12 months).
Skipping account 2114679014 due to insufficient data (less than 12 months).
Skipping account 2118487019 due to insufficient data (less than 12 months).
Skipping account 2125105016 due to insufficient data (less than 12 months).
Skipping account 2132731013 due to insufficient data (less than 12 months).
Skipping account 2136439018 due to insufficient data (less than 12 months).
Skipping account 2138249013 due to insufficient data (less than 12 months).
Skipping account 2144192015 due to insufficient data (less than 12 months).
Skipping account 2151240015 due to insufficient data (less than 12 months).
Skipping account 2154072015 due to insufficient data (less than 12 months).
Skipping account 2155857017 due to insufficient data (less than 12 months).
Skipping account 2161540011 due to insufficient data (less than 12 months).
Skipping account 2199994013 due to insufficient data (less than 12 months).
Skipping account 2206614017 due to insufficient data (less than 12 months).
Skipping account 2213633010 due to insufficient data (less than 12 months).
Skipping account 2225690012 due to insufficient data (less than 12 months).
Skipping account 2234196019 due to insufficient data (less than 12 months).
Skipping account 2251039018 due to insufficient data (less than 12 months).
Skipping account 2267300012 due to insufficient data (less than 12 months).

Training model for inconsistent salary earners...
MAE: 3209.99, RMSE: 51406.22, R-squared: 0.97
InĀ [171]:
X_test_incons_scaled = scaler_incons.transform(X_test_incons)
y_pred = model_incons.predict(X_test_incons_scaled)

plt.figure(figsize=(10, 5))
plt.scatter(y_test_incons, y_pred, alpha=0.5)
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual vs. Predicted Salary")
plt.plot([min(y_test_incons), max(y_test_incons)], [min(y_test_incons), max(y_test_incons)], 'r--')
plt.show()
No description has been provided for this image