Salary Prediction from Transaction Data¶
1. Introduction¶
This report details the process of identifying salary earners and predicting their monthly salary using transaction data from a database. The approach combines rule-based heuristics (based on financial domain knowledge) and machine learning to achieve this goal.
2. Data Source¶
The analysis uses transaction data from the customer_account_transaction_hx table in a PostgreSQL database. The table contains information about customer transactions, including transaction type, subtype, amount, description, and timestamps.
3. Methodology¶
3.1. Identifying Salary Earners¶
Four hypotheses are used to identify potential salary earners:
Keyword-Based: Transactions with descriptions containing salary-related keywords (e.g., "salary," "payroll") and initiated by 'C' (credit) are flagged as potential salary transactions.
Regular Intervals: Transactions occurring at regular monthly intervals (with some tolerance) and initiated by 'C' (credit) are considered indicative of salary payments.
Consistent Amounts: Accounts with low variance in transaction amounts for credit transactions are flagged as potentially receiving consistent salary payments.
Transaction Type/Subtype Combination: Transactions matching specific combinations of transaction type, subtype (e.g., 'T' for Transfer, 'BI' for Bank Initiated), and initiated by 'C' (credit) are considered potential salary transactions.
These hypotheses are evaluated individually and in combination using Venn diagrams to identify accounts with the strongest evidence of being salary earners. Accounts flagged by all three hypotheses are considered "verified salary earners," while those flagged by a subset of hypotheses are considered "likely salary earners."
3.2. Salary Prediction¶
For verified and likely salary earners, a machine learning model is trained to predict their monthly salary. This involves:
Feature Engineering:
- Extracting the month of the transaction.
- Creating a month sequence for each account.
- One-hot encoding the transaction type (
trx_type). - Calculating rolling sum and average of transaction amounts over a 3-month window.
Data Preparation:
- Filtering transactions for the selected accounts.
- Dropping unnecessary columns.
- Aggregating data monthly using the engineered features.
- Filtering accounts with at least 12 months of data.
- Creating training and testing sets using a sliding window approach.
Model Training and Evaluation:
- A Random Forest Regressor is trained on the scaled training data.
- Model performance is evaluated on the testing data using metrics like MAE, RMSE, and R-squared.
Import Libraries¶
!pip install sqlalchemy psycopg2-binary --quiet
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā 3.0/3.0 MB 27.3 MB/s eta 0:00:00
from sqlalchemy import create_engine, text
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import calendar
from datetime import datetime
import re
import seaborn as sns
from matplotlib_venn import venn3, venn2
from wordcloud import WordCloud
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
Import Data from Database¶
- Connect to postgres engine
- Load table
customer_account_transaction_hxfrom database
DB_USER = "salaryloan"
DB_PASSWORD = "salaryloan"
DB_NAME = "salaryloan"
DB_PORT = "10532"
DB_HOST = "dev-data.simbrellang.net"
DATABASE_URL = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(DATABASE_URL)
try:
with engine.connect() as conn:
result = conn.execute(text("SELECT version();"))
print("Connected successfully!")
except Exception as e:
print("Error connecting to database:", e)
Connected successfully!
table_name = "customer_account_transaction_hx"
df = pd.read_sql(f"SELECT * FROM {table_name}", engine)
df.head(10)
| id | accountid | trx_start_date | trx_end_date | amount | d1 | d2 | d3 | description | d4 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 511 | 1659404015 | 2024-03-24 | 2024-03-24 | 204.0 | T | CI | D | ATM PUR @ FCMBGOMBE NEWMKT ATM3 GOMBE ... | 1659404 |
| 1 | 512 | 1659404015 | 2024-03-22 | 2024-03-22 | 107.0 | T | CI | D | ATM PUR @ FCMBGOMBE NEWMKT ATM3 GOMBE ... | 1659404 |
| 2 | 513 | 1659404015 | 2024-03-18 | 2024-03-18 | 210.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 3 | 514 | 1659404015 | 2024-03-21 | 2024-03-21 | 324.0 | T | CI | D | ATM PUR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 4 | 515 | 1659404015 | 2024-03-18 | 2024-03-18 | 212.0 | T | CI | D | ATM PUR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 5 | 516 | 1659404015 | 2024-03-17 | 2024-03-17 | 206.0 | T | CI | D | ATM PUR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 6 | 517 | 1659404015 | 2024-03-17 | 2024-03-17 | 202.0 | T | CI | D | ATM PUR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 7 | 518 | 1659404015 | 2024-03-22 | 2024-03-22 | 315.0 | T | CI | D | ATM PUR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 8 | 519 | 1659404015 | 2024-03-21 | 2024-03-21 | 216.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 |
| 9 | 520 | 2014836018 | 2024-03-22 | 2024-03-22 | 10200.0 | T | CI | C | NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES | 2014836 |
Data Overiew and EDA¶
df.shape
(5354307, 10)
df.dtypes
| 0 | |
|---|---|
| id | int64 |
| accountid | object |
| trx_start_date | datetime64[ns] |
| trx_end_date | datetime64[ns] |
| amount | float64 |
| d1 | object |
| d2 | object |
| d3 | object |
| description | object |
| d4 | object |
df.nunique()
| 0 | |
|---|---|
| id | 5354307 |
| accountid | 57496 |
| trx_start_date | 182 |
| trx_end_date | 182 |
| amount | 82941 |
| d1 | 3 |
| d2 | 10 |
| d3 | 2 |
| description | 218834 |
| d4 | 57440 |
There are 57440 unique custormers
for i in ['d1', 'd2', 'd3']:
unique_values = df[i].unique()
print(i, ':', unique_values)
d1 : ['T' 'C' 'L'] d2 : ['CI' 'BI' 'SI' 'NP' 'NR' 'I' 'BS' 'O' 'IP' 'SC'] d3 : ['D' 'C']
T: Transfer. This might indicate a transfer of funds between accounts or within the same account.
C: Credit. This would represent funds being added to an account, like a deposit, salary payment, or refund.
L: Could stand for Loan. This would represent funds being disbursed as part of a loan agreement.
# Check for missing values
df.isna().sum()
| 0 | |
|---|---|
| id | 0 |
| accountid | 0 |
| trx_start_date | 0 |
| trx_end_date | 0 |
| amount | 0 |
| d1 | 0 |
| d2 | 0 |
| d3 | 0 |
| description | 0 |
| d4 | 0 |
# Change to date column to datetime
df['trx_start_date'] = pd.to_datetime(df['trx_start_date'])
df['trx_end_date'] = pd.to_datetime(df['trx_end_date'])
# Check for days difference
inconsistent_dates = df[df['trx_start_date'] != df['trx_end_date']]
print(inconsistent_dates.head(10))
Empty DataFrame Columns: [id, accountid, trx_start_date, trx_end_date, amount, d1, d2, d3, description, d4] Index: []
Apparently all transactions started and ended on the same day
# Rename columns
df = df.rename(columns={'d1': 'trx_type', 'd2': 'trx_subtype',
'd3': 'initiated_by', 'd4': 'customer_id'})
print(df.columns)
Index(['id', 'accountid', 'trx_start_date', 'trx_end_date', 'amount',
'trx_type', 'trx_subtype', 'initiated_by', 'description',
'customer_id'],
dtype='object')
Monthly Salary/Income Detection¶
Hypothesis 1: Keywords in Descriptions for a Credit Transaction¶
Criteria: Search for common salary/income-related keywords in the description column.
- Rationale: Employers often label salary payments with descriptive terms like "salary," "wages," "payroll", "income" or company-specific identifiers.
- Implementation: Use string matching or regular expressions to detect relevant keywords in the
descriptionfield. Plus it must be a credit.
keywords = [
"salary", "payroll", "income", "wage", "wages",
"earnings", "earning", "monthly pay", "net pay", "gross pay", "compensation",
"monthlypay", "netpay", "grosspay",
"remuneration", "stipend", "allowance", "bonus", "commission",
"pension", "retirement", "dividend", "benefits", "reimbursement",
"overtime", "incentive", "paycheck", "paycheque", "salary advance",
"monthly income", "income tax refund", "employer deposit",
"payroll deposit", "salary credit", "income credit", "salary transfer",
"income transfer", "salary received", "income received", "hr deposit",
"company deposit", "employer payment", "employee payment"
]
def identify_salary_transactions(df, keywords):
"""
Identifies potential salary-related transactions based on keywords
and month-year patterns in the 'description' column.
Args:
df (pd.DataFrame): The input DataFrame containing transaction data.
keywords (list): A list of salary/income-related keywords to search for.
Returns:
pd.DataFrame: The input DataFrame with an added 'is_salary_related' column
indicating potential salary transactions.
"""
month_year_patterns = [
r"\b(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s?\d{2,4}\b",
r"\b(?:JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)\s?\d{2,4}\b"
]
escaped_keywords = [re.escape(keyword.lower()) for keyword in keywords]
combined_pattern = (
r'\b(?:' + '|'.join(escaped_keywords) + r')\b|' +
'|'.join(month_year_patterns)
)
df['is_salary_related'] = df['description'].str.lower().str.contains(
combined_pattern,
na=False,
regex=True
)
return df
desc_df = identify_salary_transactions(df, keywords)
desc_data = desc_df[(desc_df['is_salary_related'] == True) & (desc_df['initiated_by'] == 'C')]
desc_data.head()
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | 532 | 2089119018 | 2024-03-25 | 2024-03-25 | 62208.0000 | T | BI | C | March 2018 Allowance | 2089119 | True |
| 63 | 574 | 2099839010 | 2024-03-24 | 2024-03-24 | 45396.0000 | T | BI | C | March 2018 Allowance | 2099839 | True |
| 779 | 1034 | 1776036014 | 2024-03-25 | 2024-03-25 | 366.8219 | T | BS | C | SALARY FOR MAR 2018 | 1776036 | True |
| 1208 | 1460 | 2274196017 | 2024-03-21 | 2024-03-21 | 24401.5200 | T | CI | C | wages/ARIBIFO BABATUNDE | 2274196 | True |
| 1214 | 1466 | 2175103017 | 2024-03-23 | 2024-03-23 | 34300.0000 | T | BI | C | March 2018 Allowance | 2175103 | True |
desc_data.shape
(10808, 11)
desc_data['accountid'].nunique()
756
desc_data['customer_id'].nunique()
756
flagged_keywords = desc_data[desc_data['is_salary_related']]['description'].str.lower().str.cat(sep=' ')
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(flagged_keywords)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Keywords in Flagged Transactions')
plt.show()
def plot_trx_type_and_subtype(data, trx_type_col='trx_type', trx_subtype_col='trx_subtype', figsize=(12, 6)):
"""
Creates count plots for `trx_type` and `trx_subtype` in the provided DataFrame.
Args:
data (pd.DataFrame): Input DataFrame containing transaction data.
trx_type_col (str): Column name for transaction type (default: 'trx_type').
trx_subtype_col (str): Column name for transaction subtype (default: 'trx_subtype').
figsize (tuple): Figure size for the plots (default: (12, 6)).
"""
# Validate input columns
if trx_type_col not in data.columns or trx_subtype_col not in data.columns:
raise ValueError(f"Columns '{trx_type_col}' and/or '{trx_subtype_col}' not found in the DataFrame.")
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=figsize)
# Plot trx_type
sns.countplot(
data=data,
x=trx_type_col,
ax=axes[0],
palette='viridis',
order=data[trx_type_col].value_counts().index
)
axes[0].set_title(f'Count of {trx_type_col}')
axes[0].set_xlabel(trx_type_col)
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)
# Plot trx_subtype
sns.countplot(
data=data,
x=trx_subtype_col,
ax=axes[1],
palette='coolwarm',
order=data[trx_subtype_col].value_counts().index
)
axes[1].set_title(f'Count of {trx_subtype_col}')
axes[1].set_xlabel(trx_subtype_col)
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)
# Adjust layout
plt.tight_layout()
plt.show()
plot_trx_type_and_subtype(desc_data)
Hypothesis 2: Regular Monthly Intervals¶
Goal: To identify potential salary earners by checking if they receive consistent payments at regular intervals, similar to how salaries are typically paid.
Logic:
- Focus on Credits: We only look at transactions where money is being added to an account (credits), as these are more likely to represent salary payments.
- Calculate Time Gaps: We measure the time between these credit transactions for each individual account.
- Check for Consistency: If the time gaps are fairly consistent (around 30 days, with some flexibility for weekends or holidays), it suggests a regular payment like a salary.
- Flag Regular Payments: We mark transactions that fall within this consistent pattern.
- Enough Regular Payments? For an account to be considered a likely salary earner, a significant portion of their credit transactions should be flagged as regular. We have set a minimum number of transactions and a threshold for the percentage of regular payments to be considered.
Example:
If a person consistently receives a payment around the 30th of every month for several months, this hypothesis would identify them as a potential salary earner.
def identify_regular_intervals(df, tolerance=0.15):
"""
Identifies transactions occurring at regular intervals based on 'C' initiated transactions.
Args:
df (pd.DataFrame): Transaction data with columns ['accountid', 'trx_start_date', 'initiated_by'].
tolerance (float): Relative tolerance for interval deviation (default: 15%).
Returns:
pd.DataFrame: Input DataFrame with added 'is_regular_interval' column.
"""
# Sort the DataFrame
df = df.sort_values(by=['accountid', 'trx_start_date'])
# Compute median interval based on 'C' transactions
c_df = df[df['initiated_by'] == 'C'].copy()
c_df['days_since_last_C_trx'] = c_df.groupby('accountid')['trx_start_date'].diff().dt.days
median_intervals = c_df.groupby('accountid')['days_since_last_C_trx'].median().reset_index(name='median_interval')
# Merge median_interval into the original DataFrame
df = df.merge(median_intervals, on='accountid', how='left')
# Track last 'C' transaction date for each account
# Change this line:
df['last_C_date'] = df.groupby('accountid')['trx_start_date'].transform(lambda x: x.where(df.loc[x.index, 'initiated_by'] == 'C').ffill())
# Compute days_since_last_C_trx
df['days_since_last_C_trx'] = (df['trx_start_date'] - df['last_C_date']).dt.days
df['days_since_last_C_trx'] = df['days_since_last_C_trx'].where(df['last_C_date'].notna())
# Determine if the interval is regular
df['is_regular_interval'] = False
valid_mask = df['days_since_last_C_trx'].notna() & df['median_interval'].notna()
df.loc[valid_mask, 'is_regular_interval'] = df[valid_mask].apply(
lambda row: abs(row['days_since_last_C_trx'] - row['median_interval']) <=
(tolerance * row['median_interval']),
axis=1
)
# Clean up temporary columns
df.drop(columns=['last_C_date', 'median_interval'], inplace=True, errors='ignore')
return df
def is_salary_earner(group, min_transactions=3, threshold=0.7):
"""
Determines if an account likely belongs to a salary earner.
Args:
group (pd.DataFrame): Transactions for a single account.
min_transactions (int): Minimum number of transactions required.
threshold (float): Minimum proportion of regular intervals required.
Returns:
bool: True if the account is likely a salary earner, False otherwise.
"""
if len(group) < min_transactions:
return False
valid_ratio = group['is_regular_interval'].mean()
return valid_ratio >= threshold
interval_df = identify_regular_intervals(df)
salary_earners = interval_df.groupby('accountid').apply(is_salary_earner).reset_index(name='is_salary_earner')
interval_data = interval_df[
(interval_df['is_regular_interval'] == True) &
(interval_df['initiated_by'] == 'C')
]
display(interval_data.head())
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | days_since_last_C_trx | is_regular_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 299180 | 102834 | 0979170020 | 2024-03-14 | 2024-03-14 | 29680.0 | T | CI | C | NIP FRM IBRAHIM AUDI-INTBNKRET MASUR | 0979170 | False | 0.0 | True |
| 299181 | 102835 | 0979170020 | 2024-03-15 | 2024-03-15 | 5050.0 | T | CI | C | NIP FRM ZAYYANU MUHAMMED-UnionMobile Trf from | 0979170 | False | 0.0 | True |
| 299182 | 102828 | 0979170020 | 2024-03-16 | 2024-03-16 | 5350.0 | T | CI | C | NIP FRM KABIRU AMINU DANTANI-USSD KABIRU AMINU | 0979170 | False | 0.0 | True |
| 299187 | 102833 | 0979170020 | 2024-03-17 | 2024-03-17 | 16350.0 | T | CI | C | NIP FRM 2348066269231-PWC USSD TF 78871581473 | 0979170 | False | 0.0 | True |
| 299188 | 102836 | 0979170020 | 2024-03-17 | 2024-03-17 | 10900.0 | T | CI | C | UNU,T,162583398982,ALIYU ABDULRAHMANLaNG | 0979170 | False | 0.0 | True |
interval_data.shape
(25928, 13)
interval_data['accountid'].nunique()
78
interval_data['customer_id'].nunique()
78
plot_trx_type_and_subtype(interval_data)
Hypothesis 3: Consistent Transaction Amounts¶
Goal: Identify potential salary earners by checking if they receive transactions with consistent amounts, similar to how salaries are typically paid.
Rationale: Salaries are often fixed or have minor fluctuations (taxes, deductions, etc.). We can use the coefficient of variation (CV) to measure the variability of transaction amounts for each account. A low CV suggests a stable income source, like a salary.
Implementation:
- Calculate CV: For each account, we calculate the coefficient of variation (CV) of their credit ('C') transactions. The CV is the ratio of the standard deviation to the mean of the amounts.
- Flag Consistency: If an account's CV is below a certain threshold (0%), we flag it as having consistent transaction amounts.
- Identify Salary Earners: Accounts flagged as having consistent amounts are considered potential salary earners.
def calculate_coefficient_of_variation(group):
"""
Calculates the coefficient of variation (CV) for a group of transactions
where 'initiated_by' is 'C'.
Args:
group (pd.DataFrame): Transactions for a single account (grouped by 'accountid').
Returns:
float: Coefficient of variation (std / mean). Returns NaN if mean is zero.
"""
amounts = group[group['initiated_by'] == 'C']['amount']
mean = amounts.mean()
std = amounts.std(ddof=0)
if mean == 0:
return float('nan')
return std / mean
def flag_consistent_amounts(group, cv_threshold=0.10):
"""
Flags accounts with low variance in transaction amounts where 'initiated_by' is 'C'.
Args:
group (pd.DataFrame): Transactions for a single account.
cv_threshold (float): Maximum allowed CV to flag as consistent (default: 0.10).
Returns:
pd.Series: Boolean series indicating if the transaction belongs to a consistent account.
"""
# Filter for transactions initiated by 'C' before calculating CV
filtered_group = group[group['initiated_by'] == 'C']
cv = calculate_coefficient_of_variation(filtered_group) # Pass filtered group
is_consistent = cv <= cv_threshold if not pd.isna(cv) else False
return pd.Series(
[is_consistent] * len(group),
index=group.index,
name='is_consistent_amount'
)
def identify_consistent_amount_accounts(df, cv_threshold=0.10):
"""
Identifies accounts with consistent transaction amounts where 'initiated_by' is 'C'.
Args:
df (pd.DataFrame): Transaction data with columns ['accountid', 'amount', 'initiated_by'].
cv_threshold (float): Maximum allowed CV (default: 0.10).
Returns:
pd.DataFrame: Input DataFrame with an added 'is_consistent_amount' column.
"""
df = df.groupby('accountid').apply(
lambda group: flag_consistent_amounts(group, cv_threshold)
).reset_index(level=0, drop=True)
return df
const_df = identify_consistent_amount_accounts(df, cv_threshold=0.10)
const_df = df.merge(const_df, left_index=True, right_index=True)
df['is_consistent_amount'] = const_df['is_consistent_amount']
const_data = const_df[
(const_df['is_consistent_amount']) &
(const_df['initiated_by']=='C')
]
display(const_data.head())
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | is_consistent_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 513 | 1659404015 | 2024-03-18 | 2024-03-18 | 210.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True |
| 8 | 519 | 1659404015 | 2024-03-21 | 2024-03-21 | 216.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True |
| 9 | 520 | 2014836018 | 2024-03-22 | 2024-03-22 | 10200.0 | T | CI | C | NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES | 2014836 | False | True |
| 11 | 522 | 2014836018 | 2024-03-17 | 2024-03-17 | 13200.0 | T | CI | C | NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q | 2014836 | False | True |
| 54 | 565 | 1174712011 | 2024-03-21 | 2024-03-21 | 15450.0 | T | CI | C | NIP FRM NDALIMAN AISHATU SHABA-FBNMOBILE SAEED | 1174712 | False | True |
const_data.shape
(329979, 12)
const_data['accountid'].nunique()
21888
const_data['customer_id'].nunique()
21879
plot_trx_type_and_subtype(const_data)
Hypothesis 4: Transaction Type/Subtype/Initiator Combination¶
This hypothesis identifies salary earners based on specific combinations of transaction attributes (trx_type, trx_subtype, initiated_by).
Criteria:¶
Transaction Type (
trx_type):- Includes
T(Transfer) orC(Cash), capturing both electronic transfers and cash-based income.
- Includes
Transaction Subtype (
trx_subtype):- Includes
BI(Bank Initiated),I(Inbound), orBS(Basic Salary):BI: Bank-initiated transactions, often used for automated credits like interest or fees but also applicable to structured payments.I: Inbound payments, representing incoming funds from external sources (e.g., employer transfers).BS: Explicitly labeled as "Basic Salary," directly indicating wage or salary payments.CI: "Cash In", salaries pain in cash to a bank.
- Includes
Initiated By (
initiated_by):- Restricted to
C(Creditor/Bank/System), ensuring the transaction is initiated externally (e.g., by an employer or banking system) rather than by the account holder.
- Restricted to
Amount:
- Requires a positive value (
amount > 0), reflecting a credit to the account.
- Requires a positive value (
Rationale:¶
- Salaries are typically credited into accounts via structured mechanisms such as bank transfers or direct deposits.
def flag_salary_type_transactions(df):
"""
Flags transactions that match the salary criteria based on type, subtype, and initiator.
Args:
df (pd.DataFrame): Transaction data with columns ['trx_type', 'trx_subtype', 'initiated_by', 'amount'].
Returns:
pd.DataFrame: Input DataFrame with an added 'is_salary_type' column.
"""
df['is_salary_type'] = (
((df['trx_type'] == 'T') | (df['trx_type'] == 'C')) &
((df['trx_subtype'] == 'BI') | (df['trx_subtype'] == 'I') | (df['trx_subtype'] == 'BS') | (df['trx_subtype'] == 'CI')) &
(df['initiated_by'] == 'C') &
(df['amount'] > 0)
)
return df
def is_salary_earner_by_type(group, min_transactions=3, threshold=0.7):
"""
Determines if an account likely belongs to a salary earner based on transaction type criteria.
Args:
group (pd.DataFrame): Transactions for a single account.
min_transactions (int): Minimum transactions required to qualify (default: 3).
threshold (float): Minimum proportion of salary-type transactions (default: 0.7).
Returns:
bool: True if the account meets the criteria, False otherwise.
"""
if len(group) < min_transactions:
return False
valid_ratio = group['is_salary_type'].mean()
return valid_ratio >= threshold
trx_df = flag_salary_type_transactions(df)
trx_data = trx_df[trx_df['is_salary_type']]
trx_data.head()
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | is_consistent_amount | is_salary_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 513 | 1659404015 | 2024-03-18 | 2024-03-18 | 210.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True | True |
| 8 | 519 | 1659404015 | 2024-03-21 | 2024-03-21 | 216.0 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True | True |
| 9 | 520 | 2014836018 | 2024-03-22 | 2024-03-22 | 10200.0 | T | CI | C | NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES | 2014836 | False | True | True |
| 11 | 522 | 2014836018 | 2024-03-17 | 2024-03-17 | 13200.0 | T | CI | C | NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q | 2014836 | False | True | True |
| 17 | 528 | 2089119018 | 2024-03-22 | 2024-03-22 | 35000.0 | T | CI | C | web:ann | 2089119 | False | False | True |
trx_data.shape
(991655, 13)
trx_data['accountid'].nunique()
35075
trx_data['customer_id'].nunique()
35051
plot_trx_type_and_subtype(trx_data)
Investigating all Hypothesis¶
def plot_hypothesis_overlap(hypothesis1_df, hypothesis2_df,
hypothesis3_df, hypothesis4_df,
account_col='accountid'):
"""
Plots a Venn diagram showing overlap between the hypotheses.
Args:
hypothesis1_df (pd.DataFrame): DataFrame with Hypothesis 1 results
hypothesis2_df (pd.DataFrame): DataFrame with Hypothesis 2 results
hypothesis3_df (pd.DataFrame): DataFrame with Hypothesis 3 results
hypothesis4_df (pd.DataFrame): DataFrame with Hypothesis 4 results
account_col (str): Account identifier column.
"""
set1 = set(hypothesis2_df[account_col][hypothesis2_df['is_regular_interval']])
set2 = set(hypothesis3_df[account_col][hypothesis3_df['is_consistent_amount']])
set3 = set(hypothesis1_df[account_col][hypothesis1_df['is_salary_related']])
set4 = set(hypothesis4_df[account_col][hypothesis4_df['is_salary_type']])
plt.figure(figsize=(10, 10))
venn3([set2, set3, set4], set_labels=('Consistent Amount',
'Salary Description', 'Transaction Type'))
plt.title('Overlap Between Hypotheses')
plt.show()
plot_hypothesis_overlap(desc_data, interval_data, const_data, trx_data)
def filter_venn_section(df, **kwargs):
"""
Filters accounts based on specified combinations of hypothesis flags.
Args:
df (pd.DataFrame): DataFrame with columns ['is_salary_related', 'is_consistent_amount', 'is_salary_type'].
**kwargs: Key-value pairs specifying the desired state of each hypothesis flag.
For example: {'is_salary_related': True, 'is_consistent_amount': False}.
Returns:
pd.DataFrame: Filtered accounts matching the specified Venn section.
"""
valid_columns = {'is_salary_related', 'is_consistent_amount', 'is_salary_type'}
df1 = df[df['initiated_by']=='C']
invalid_keys = set(kwargs.keys()) - valid_columns
if invalid_keys:
raise ValueError(f"Invalid keys: {invalid_keys}. Valid keys are {valid_columns}.")
condition = pd.Series([True] * len(df1), index=df1.index)
for key, value in kwargs.items():
condition &= (df1[key] == value)
return df1[condition]
green_section = filter_venn_section(
df,
is_salary_related=True,
is_consistent_amount=False,
is_salary_type=True
)
display(green_section.head(10))
plot_trx_type_and_subtype(green_section)
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | is_consistent_amount | is_salary_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | 532 | 2089119018 | 2024-03-25 | 2024-03-25 | 62208.00 | T | BI | C | March 2018 Allowance | 2089119 | True | False | True |
| 63 | 574 | 2099839010 | 2024-03-24 | 2024-03-24 | 45396.00 | T | BI | C | March 2018 Allowance | 2099839 | True | False | True |
| 1208 | 1460 | 2274196017 | 2024-03-21 | 2024-03-21 | 24401.52 | T | CI | C | wages/ARIBIFO BABATUNDE | 2274196 | True | False | True |
| 1214 | 1466 | 2175103017 | 2024-03-23 | 2024-03-23 | 34300.00 | T | BI | C | March 2018 Allowance | 2175103 | True | False | True |
| 1652 | 1903 | 2037208018 | 2024-03-17 | 2024-03-17 | 49830.00 | T | CI | C | Online : March 2018 half salary | 2037208 | True | False | True |
| 1813 | 2066 | 2071298017 | 2024-03-21 | 2024-03-21 | 61056.00 | T | BI | C | March 2018 Allowance | 2071298 | True | False | True |
| 5184 | 5423 | 1741580016 | 2024-03-20 | 2024-03-20 | 30282.00 | T | BI | C | March 2018 Allowance | 1741580 | True | False | True |
| 5349 | 5587 | 1606101019 | 2024-03-22 | 2024-03-22 | 33000.00 | T | CI | C | web:January and February salary Abosi | 1606101 | True | False | True |
| 6161 | 6396 | 1986459014 | 2024-03-19 | 2024-03-19 | 70200.00 | T | BI | C | SALARY PAYMENTS | 1986459 | True | False | True |
| 6863 | 7097 | 1881876011 | 2024-03-22 | 2024-03-22 | 63360.00 | T | BI | C | March 2018 Allowance | 1881876 | True | False | True |
Majority of these are Allowwances and Wages.
They have Salary descriptions but are not consistent
yellow_section = filter_venn_section(
df,
is_salary_related=False,
is_consistent_amount=True,
is_salary_type=True
)
display(yellow_section.head(10))
plot_trx_type_and_subtype(yellow_section)
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | is_consistent_amount | is_salary_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 513 | 1659404015 | 2024-03-18 | 2024-03-18 | 210.0000 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True | True |
| 8 | 519 | 1659404015 | 2024-03-21 | 2024-03-21 | 216.0000 | T | CI | C | ATM PUR RR @ FCMB NEW MKT RD, GOMBE II ... | 1659404 | False | True | True |
| 9 | 520 | 2014836018 | 2024-03-22 | 2024-03-22 | 10200.0000 | T | CI | C | NIP FRM ESIEN ESESIEN EDET-NIP FROM ESIEN ES | 2014836 | False | True | True |
| 11 | 522 | 2014836018 | 2024-03-17 | 2024-03-17 | 13200.0000 | T | CI | C | NIP FRM ROSEMARY OKECHUKWU-FT Queen ESIEN Q | 2014836 | False | True | True |
| 54 | 565 | 1174712011 | 2024-03-21 | 2024-03-21 | 15450.0000 | T | CI | C | NIP FRM NDALIMAN AISHATU SHABA-FBNMOBILE SAEED | 1174712 | False | True | True |
| 60 | 571 | 1999974016 | 2024-03-20 | 2024-03-20 | 180.7032 | T | BI | C | REVERSAL ON ISSUER FEES CHARGE | 1999974 | False | True | True |
| 70 | 581 | 1353446016 | 2024-03-22 | 2024-03-22 | 20330.0000 | T | CI | C | Online : mezie | 1353446 | False | True | True |
| 121 | 632 | 1726415010 | 2024-03-16 | 2024-03-16 | 10833.2000 | T | CI | C | CSH DEPOSIT BY : OJO RONKE | 1726415 | False | True | True |
| 122 | 633 | 1510438010 | 2024-03-15 | 2024-03-15 | 84240.0000 | T | CI | C | CSH DEPOSIT BY : NWAIWU HAPPINESS | 1510438 | False | True | True |
| 123 | 634 | 1732503011 | 2024-03-23 | 2024-03-23 | 21200.0000 | T | CI | C | NIP FRM VINCENT CHUKWUDI MACFRIDO-Inflow from | 1732503 | False | True | True |
These are potential Salary/ Monthly income earners.
Majority of which are cash deposit, indicating cash handling
# Get accounts flagged by all three hypotheses
all_three_hypotheses = filter_venn_section(
df,
is_salary_related=True,
is_consistent_amount=True,
is_salary_type=True
)
display(all_three_hypotheses.head(10))
plot_trx_type_and_subtype(all_three_hypotheses)
| id | accountid | trx_start_date | trx_end_date | amount | trx_type | trx_subtype | initiated_by | description | customer_id | is_salary_related | is_consistent_amount | is_salary_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 779 | 1034 | 1776036014 | 2024-03-25 | 2024-03-25 | 366.8219 | T | BS | C | SALARY FOR MAR 2018 | 1776036 | True | True | True |
| 2050 | 2301 | 2182865010 | 2024-03-18 | 2024-03-18 | 39375.0000 | T | BI | C | GenieNG STAFF DEC BONUS | 2182865 | True | True | True |
| 5324 | 5562 | 0853687015 | 2024-03-22 | 2024-03-22 | 16451.8120 | T | CI | C | N-10022744651/Sigma _Monthly pension from UPCL | 0853687 | True | True | True |
| 6126 | 6361 | 1533377013 | 2024-03-23 | 2024-03-23 | 22470.0000 | T | CI | C | MPCS DIVIDEND | 1533377 | True | True | True |
| 7080 | 7313 | 1757605017 | 2024-03-18 | 2024-03-18 | 206034.8288 | T | CI | C | COP08DPSS959770;54468994;NSUK FEB 2018 SALARY-NSU | 1757605 | True | True | True |
| 7865 | 8095 | 1893248017 | 2024-03-21 | 2024-03-21 | 7518.1284 | T | I | C | MARCH 18 PENSION PAYMENT=FPCNL | 1893248 | True | True | True |
| 8242 | 8471 | 1298420010 | 2024-03-21 | 2024-03-21 | 186329.6344 | T | CI | C | 137/MAR2018 Salary/3152861/ LANG | 1298420 | True | True | True |
| 9362 | 9586 | 1071766010 | 2024-03-18 | 2024-03-18 | 11698.6400 | T | CI | C | N-10022744037/Sigma _Monthly pension from UPCL | 1071766 | True | True | True |
| 9369 | 9593 | 0966021014 | 2024-03-19 | 2024-03-19 | 42230.0000 | T | CI | C | MPCS DIVIDEND | 0966021 | True | True | True |
| 9469 | 9693 | 1144064018 | 2024-03-18 | 2024-03-18 | 35999.0703 | T | I | C | FEB 2018 ALLOWANCE=NIGER STATE | 1144064 | True | True | True |
Selecting Monthly Income Earners¶
def generate_salary_earners_table(all_three_hypotheses):
results = []
for accountid, group in all_three_hypotheses.groupby('accountid'):
# Calculate required metrics
num_months = len(group)
last_6_months = group[group['trx_start_date'] >= (datetime.now() - timedelta(days=180))]
least_inflow = last_6_months['amount'].min()
avg_salary = group['amount'].mean()
# Estimated next salary
# Calculate days_since_last_trx within the loop
group['days_since_last_trx'] = group['trx_start_date'].diff().dt.days
median_interval = group['days_since_last_trx'].median()
last_date = group['trx_start_date'].max()
next_date = last_date + timedelta(days=median_interval)
next_amount = avg_salary
# Boolean flags
days_since_last = (datetime.now() - last_date).days
has_45d = days_since_last <= 45
has_2m = len(group[group['trx_start_date'] >= (datetime.now() - timedelta(days=60))]) >= 2
results.append({
'accountid': accountid,
'num_months': num_months,
'least_inflow_6m': least_inflow,
'avg_monthly_salary': avg_salary,
'estimated_next_amount': next_amount,
'estimated_next_date': next_date,
'45daysalary': has_45d,
'2monthssalary': has_2m
})
final_df = pd.DataFrame(results)
final_df = final_df.dropna()
return final_df
final_table = generate_salary_earners_table(all_three_hypotheses)
# Display results
print(f"Found {final_table['accountid'].nunique()} verified salary earners")
display(final_table.head(20))
Found 383 verified salary earners
| accountid | num_months | least_inflow_6m | avg_monthly_salary | estimated_next_amount | estimated_next_date | 45daysalary | 2monthssalary | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0171115023 | 14 | 4.251146e+03 | 4.402972e+03 | 4.402972e+03 | 2025-04-12 | True | False |
| 1 | 0194052022 | 14 | 2.658945e+04 | 2.738234e+04 | 2.738234e+04 | 2025-04-14 | True | False |
| 2 | 0532452020 | 14 | 8.000000e+04 | 8.371429e+04 | 8.371429e+04 | 2025-04-15 | True | False |
| 3 | 0623122023 | 14 | 4.100378e+03 | 4.297567e+03 | 4.297567e+03 | 2025-04-16 | True | False |
| 4 | 0676572026 | 14 | 4.232361e+04 | 4.459095e+04 | 4.459095e+04 | 2025-04-15 | True | False |
| 5 | 0761118025 | 14 | 1.020951e+04 | 1.072728e+04 | 1.072728e+04 | 2025-04-09 | True | False |
| 6 | 0820472015 | 14 | 2.023256e+05 | 2.105433e+05 | 2.105433e+05 | 2025-04-18 | True | True |
| 7 | 0835181025 | 14 | 4.288626e+04 | 4.376991e+04 | 4.376991e+04 | 2025-04-10 | True | False |
| 8 | 0835200018 | 14 | 4.085767e+04 | 4.211659e+04 | 4.211659e+04 | 2025-04-15 | True | False |
| 9 | 0846827013 | 14 | 3.794400e+02 | 3.882086e+02 | 3.882086e+02 | 2025-04-11 | True | False |
| 10 | 0852762012 | 14 | 2.064837e+04 | 2.138581e+04 | 2.138581e+04 | 2025-04-12 | True | False |
| 11 | 0853195013 | 14 | 9.297402e+03 | 9.476199e+03 | 9.476199e+03 | 2025-04-16 | True | False |
| 12 | 0853687015 | 14 | 1.597724e+04 | 1.646311e+04 | 1.646311e+04 | 2025-04-17 | True | False |
| 13 | 0856304027 | 14 | 7.754812e+04 | 8.092700e+04 | 8.092700e+04 | 2025-04-16 | True | False |
| 14 | 0875566019 | 14 | 5.871242e+03 | 6.048038e+03 | 6.048038e+03 | 2025-04-15 | True | False |
| 15 | 0889240019 | 14 | 1.150220e+06 | 1.189266e+06 | 1.189266e+06 | 2025-04-22 | True | False |
| 16 | 0890507013 | 14 | 2.233007e+04 | 2.354227e+04 | 2.354227e+04 | 2025-04-14 | True | False |
| 17 | 0897678011 | 14 | 5.514127e+04 | 5.663260e+04 | 5.663260e+04 | 2025-04-19 | True | False |
| 18 | 0903403013 | 14 | 6.701709e+04 | 6.924467e+04 | 6.924467e+04 | 2025-04-13 | True | False |
| 19 | 0903407011 | 14 | 4.643482e+04 | 4.958575e+04 | 4.958575e+04 | 2025-04-13 | True | False |
likely_salary_earner = pd.concat([yellow_section, green_section])
likely_salary_earner = likely_salary_earner.drop_duplicates(subset=['id'])
likely_salary_earner = generate_salary_earners_table(likely_salary_earner)
# Display results
print(f"Found {likely_salary_earner['accountid'].nunique()} likely salary earners")
display(likely_salary_earner.head(20))
Found 20797 likely salary earners
| accountid | num_months | least_inflow_6m | avg_monthly_salary | estimated_next_amount | estimated_next_date | 45daysalary | 2monthssalary | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0022324028 | 14 | 40800.00 | 4.162857e+04 | 4.162857e+04 | 2025-04-17 | True | False |
| 1 | 0022848049 | 14 | 1395360.00 | 1.450080e+06 | 1.450080e+06 | 2025-04-14 | True | False |
| 2 | 0025339021 | 14 | 553480.00 | 5.789229e+05 | 5.789229e+05 | 2025-04-18 | True | False |
| 3 | 0027643012 | 14 | 106000.00 | 1.051429e+05 | 1.051429e+05 | 2025-04-19 | True | False |
| 4 | 0050282020 | 14 | 15000.00 | 1.571786e+04 | 1.571786e+04 | 2025-04-19 | True | False |
| 5 | 0051390023 | 14 | 153000.00 | 1.588929e+05 | 1.588929e+05 | 2025-04-15 | True | False |
| 6 | 0060968040 | 14 | 202.00 | 2.081429e+02 | 2.081429e+02 | 2025-04-18 | True | False |
| 7 | 0103261028 | 14 | 92000.00 | 9.614000e+04 | 9.614000e+04 | 2025-04-16 | True | False |
| 8 | 0104926023 | 14 | 40000.00 | 4.222857e+04 | 4.222857e+04 | 2025-04-18 | True | False |
| 9 | 0108574013 | 14 | 1010.00 | 1.055000e+03 | 1.055000e+03 | 2025-04-19 | True | False |
| 10 | 0109992036 | 14 | 4040.00 | 4.177143e+03 | 4.177143e+03 | 2025-04-17 | True | False |
| 11 | 0110206027 | 28 | 10100.00 | 1.049286e+04 | 1.049286e+04 | 2025-03-27 | True | True |
| 12 | 0112600038 | 14 | 5100.00 | 5.275000e+03 | 5.275000e+03 | 2025-04-09 | True | False |
| 13 | 0160518039 | 14 | 40000.00 | 4.191429e+04 | 4.191429e+04 | 2025-04-18 | True | True |
| 14 | 0162780032 | 14 | 5000.00 | 5.175000e+03 | 5.175000e+03 | 2025-04-21 | True | False |
| 15 | 0163308020 | 14 | 2525.00 | 2.625000e+03 | 2.625000e+03 | 2025-04-11 | True | False |
| 16 | 0163704024 | 14 | 323200.00 | 3.344000e+05 | 3.344000e+05 | 2025-04-19 | True | False |
| 17 | 0164479033 | 14 | 10200.00 | 1.051429e+04 | 1.051429e+04 | 2025-04-16 | True | False |
| 18 | 0170195039 | 14 | 3923.16 | 4.127725e+03 | 4.127725e+03 | 2025-04-14 | True | False |
| 19 | 0171813037 | 14 | 10000.00 | 1.040000e+04 | 1.040000e+04 | 2025-04-10 | True | False |
def analyze_salary_earners(final_df):
"""
Analyzes salary earners and identifies high earners (>=10k predicted salary).
Args:
final_df (pd.DataFrame): DataFrame containing salary earner information.
Returns:
pd.DataFrame: DataFrame with high earner statistics, including count and minimum inflows.
"""
high_earners = final_df[final_df['estimated_next_amount'] >= 10000]
high_earners['least_inflow_6m'] = high_earners['least_inflow_6m']
count_high = len(high_earners)
high_earner_details = high_earners[['accountid', 'least_inflow_6m']].reset_index(drop=True)
return high_earner_details, count_high
high_earner_details_df, total_high_earners = analyze_salary_earners(final_table)
print(f"\nTotal High Earners: {total_high_earners}")
display(high_earner_details_df)
Total High Earners: 307
| accountid | least_inflow_6m | |
|---|---|---|
| 0 | 0194052022 | 26589.4500 |
| 1 | 0532452020 | 80000.0000 |
| 2 | 0676572026 | 42323.6100 |
| 3 | 0761118025 | 10209.5100 |
| 4 | 0820472015 | 202325.6088 |
| ... | ... | ... |
| 302 | 2253348011 | 25721.5286 |
| 303 | 2253502017 | 26780.0000 |
| 304 | 2253954018 | 25391.8343 |
| 305 | 2267310019 | 21000.0000 |
| 306 | 2271021019 | 52800.0000 |
307 rows Ć 2 columns
high_earner_details_df.to_csv('high_earner_details.csv', index=False)
likely_salary_earner.to_csv('likely_salary_earner.csv', index=False)
final_table.to_csv('final_table.csv', index=False)
ML Prediction¶
def add_feature_engineering(df):
"""
Engineers new features to the input DataFrame for salary prediction.
Adds features like month, month sequence, one-hot encoded transaction type,
3-month rolling sum, and 3-month rolling average of transaction amounts.
Args:
df (pd.DataFrame): The input DataFrame containing transaction data.
Returns:
pd.DataFrame: The DataFrame with engineered features added.
"""
df['month'] = df['trx_start_date'].dt.month
df['month_seq'] = df.groupby(['accountid', 'month']).ngroup() + 1
# Categorical encoding: one-hot encode trx_type
encoder = OneHotEncoder(sparse_output=False)
encoded_trx_type = encoder.fit_transform(df[['trx_type']])
encoded_df = pd.DataFrame(encoded_trx_type, columns=encoder.get_feature_names_out(['trx_type']))
df = pd.concat([df, encoded_df], axis=1)
# Rolling statistics: sort by account and date
df = df.sort_values(['accountid', 'trx_start_date'])
df['rolling_sum_3m'] = df.groupby('accountid')['amount'].rolling(window=3,
min_periods=1).sum().reset_index(0, drop=True)
df['rolling_avg_3m'] = df.groupby('accountid')['amount'].rolling(window=3,
min_periods=1).mean().reset_index(0, drop=True)
return df
def prepare_data(df_transactions, accounts):
"""
Prepares transaction data for training and testing a salary prediction model.
Filters transactions for specified accounts, performs feature engineering,
aggregates data monthly, filters for accounts with sufficient data, and
creates training and testing sets using a sliding window approach.
Args:
df_transactions (pd.DataFrame): The input DataFrame containing all transaction data.
accounts (list): A list of account IDs to include in the data preparation.
Returns:
tuple: A tuple containing the training and testing data as NumPy arrays:
(X_train, y_train, X_test, y_test).
"""
df_filtered = df_transactions[df_transactions['accountid'].isin(accounts)].copy()
print(f"Filtered data for {len(accounts)} accounts.")
print(f"Total transactions: {len(df_filtered)}")
# Drop unnecessary columns
df_filtered = df_filtered.drop(['description', 'id', 'customer_id',
'trx_end_date', 'is_salary_related',
'is_consistent_amount', 'is_salary_type'], axis=1)
# Add feature engineering
df_filtered = add_feature_engineering(df_filtered)
# Aggregate monthly data with new features
agg_funcs = {
'amount': 'mean',
'rolling_sum_3m': 'last',
'rolling_avg_3m': 'last',
'month': 'first'
}
encoded_cols = [col for col in df_filtered.columns if col.startswith('trx_type_')]
for col in encoded_cols:
agg_funcs[col] = 'sum'
monthly_data = df_filtered.groupby(['accountid', 'month_seq']).agg(agg_funcs).reset_index()
# Filter accounts with at least 12 months
account_month_counts = monthly_data.groupby('accountid')['month_seq'].max()
valid_accounts = account_month_counts[account_month_counts >= 12].index
monthly_data = monthly_data[monthly_data['accountid'].isin(valid_accounts)]
# Create training and testing sequences
X_train, y_train, X_test, y_test = [], [], [], []
feature_cols = ['accountid', 'amount', 'rolling_sum_3m', 'rolling_avg_3m',
'month'] + encoded_cols
for account in valid_accounts:
account_data = monthly_data[monthly_data['accountid'] == account].sort_values('month_seq')
# Check if account has enough data for training and testing sequences
if len(account_data) >= 12: # Ensure at least 12 months of data
for t in range(5, 8):
X_train.append(account_data.iloc[t-5:t][feature_cols].values.flatten())
y_train.append(account_data['amount'].iloc[t])
for t in range(8, 12):
X_test.append(account_data.iloc[t-5:t][feature_cols].values.flatten())
y_test.append(account_data['amount'].iloc[t])
else:
print(f"Skipping account {account} due to insufficient data (less than 12 months).")
return np.array(X_train), np.array(y_train), np.array(X_test), np.array(y_test)
def train_model(X_train, y_train, X_test, y_test):
"""
Trains and evaluates a Random Forest Regressor for salary prediction.
Scales the input features using StandardScaler, trains the model,
predicts on the test set, and calculates evaluation metrics
(MAE, RMSE, R-squared).
Args:
X_train (np.ndarray): Training data features.
y_train (np.ndarray): Training data target (salary).
X_test (np.ndarray): Testing data features.
y_test (np.ndarray): Testing data target (salary).
Returns:
tuple: A tuple containing the trained model and the scaler object:
(model, scaler).
"""
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate model
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}, R-squared: {r2:.2f}")
return model, scaler
Consistent salary earners¶
consistent_accounts = final_table['accountid'].unique()
X_train_cons, y_train_cons, X_test_cons, y_test_cons = prepare_data(df, consistent_accounts)
if len(X_train_cons) > 0:
model_cons, scaler_cons = train_model(X_train_cons, y_train_cons, X_test_cons, y_test_cons)
print("Model trained for consistent salary earners.")
else:
print("No accounts with sufficient data for consistent salary earners.")
Filtered data for 383 accounts. Total transactions: 22162 Skipping account 0846827013 due to insufficient data (less than 12 months). Skipping account 1214711015 due to insufficient data (less than 12 months). Skipping account 1218765016 due to insufficient data (less than 12 months). Skipping account 1377321016 due to insufficient data (less than 12 months). Skipping account 1486608013 due to insufficient data (less than 12 months). Skipping account 1910312013 due to insufficient data (less than 12 months). Skipping account 1949225012 due to insufficient data (less than 12 months). MAE: 1909.79, RMSE: 4610.21, R-squared: 0.96 Model trained for consistent salary earners.
X_test_cons_scaled = scaler_cons.transform(X_test_cons)
y_pred = model_cons.predict(X_test_cons_scaled)
plt.figure(figsize=(10, 5))
plt.scatter(y_test_cons, y_pred, alpha=0.5)
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual vs. Predicted Salary")
plt.plot([min(y_test_cons), max(y_test_cons)], [min(y_test_cons), max(y_test_cons)], 'r--')
plt.show()
Inconsistent salary earners¶
inconsistent_accounts = likely_salary_earner['accountid'].unique()
X_train_incons, y_train_incons, X_test_incons, y_test_incons = prepare_data(df, inconsistent_accounts)
if len(X_train_incons) > 0:
print("\nTraining model for inconsistent salary earners...")
model_incons, scaler_incons = train_model(X_train_incons, y_train_incons, X_test_incons, y_test_incons)
else:
print("No accounts with sufficient data for inconsistent salary earners.")
Filtered data for 20797 accounts. Total transactions: 1508309 Skipping account 0164479033 due to insufficient data (less than 12 months). Skipping account 0252542021 due to insufficient data (less than 12 months). Skipping account 0793130028 due to insufficient data (less than 12 months). Skipping account 0811893018 due to insufficient data (less than 12 months). Skipping account 0820374023 due to insufficient data (less than 12 months). Skipping account 0833877010 due to insufficient data (less than 12 months). Skipping account 0841163017 due to insufficient data (less than 12 months). Skipping account 0844752012 due to insufficient data (less than 12 months). Skipping account 0869552015 due to insufficient data (less than 12 months). Skipping account 0881519038 due to insufficient data (less than 12 months). Skipping account 0889291028 due to insufficient data (less than 12 months). Skipping account 0900942018 due to insufficient data (less than 12 months). Skipping account 0916761018 due to insufficient data (less than 12 months). Skipping account 0969843017 due to insufficient data (less than 12 months). Skipping account 0999364014 due to insufficient data (less than 12 months). Skipping account 1084818010 due to insufficient data (less than 12 months). Skipping account 1096189016 due to insufficient data (less than 12 months). Skipping account 1126705014 due to insufficient data (less than 12 months). Skipping account 1128137015 due to insufficient data (less than 12 months). Skipping account 1143631017 due to insufficient data (less than 12 months). Skipping account 1143803016 due to insufficient data (less than 12 months). Skipping account 1164852015 due to insufficient data (less than 12 months). Skipping account 1167885018 due to insufficient data (less than 12 months). Skipping account 1170449014 due to insufficient data (less than 12 months). Skipping account 1170914013 due to insufficient data (less than 12 months). Skipping account 1183525013 due to insufficient data (less than 12 months). Skipping account 1202640039 due to insufficient data (less than 12 months). Skipping account 1204052012 due to insufficient data (less than 12 months). Skipping account 1216036017 due to insufficient data (less than 12 months). Skipping account 1219147015 due to insufficient data (less than 12 months). Skipping account 1219701013 due to insufficient data (less than 12 months). Skipping account 1222737014 due to insufficient data (less than 12 months). Skipping account 1231595016 due to insufficient data (less than 12 months). Skipping account 1235579018 due to insufficient data (less than 12 months). Skipping account 1239695015 due to insufficient data (less than 12 months). Skipping account 1240446015 due to insufficient data (less than 12 months). Skipping account 1253643018 due to insufficient data (less than 12 months). Skipping account 1256392018 due to insufficient data (less than 12 months). Skipping account 1261631010 due to insufficient data (less than 12 months). Skipping account 1271364010 due to insufficient data (less than 12 months). Skipping account 1290617012 due to insufficient data (less than 12 months). Skipping account 1298675023 due to insufficient data (less than 12 months). Skipping account 1299084015 due to insufficient data (less than 12 months). Skipping account 1303872016 due to insufficient data (less than 12 months). Skipping account 1311306013 due to insufficient data (less than 12 months). Skipping account 1323543017 due to insufficient data (less than 12 months). Skipping account 1327803016 due to insufficient data (less than 12 months). Skipping account 1336761011 due to insufficient data (less than 12 months). Skipping account 1348526013 due to insufficient data (less than 12 months). Skipping account 1358686015 due to insufficient data (less than 12 months). Skipping account 1379381012 due to insufficient data (less than 12 months). Skipping account 1381224019 due to insufficient data (less than 12 months). Skipping account 1385952015 due to insufficient data (less than 12 months). Skipping account 1391474017 due to insufficient data (less than 12 months). Skipping account 1391810019 due to insufficient data (less than 12 months). Skipping account 1399454019 due to insufficient data (less than 12 months). Skipping account 1401542015 due to insufficient data (less than 12 months). Skipping account 1403496019 due to insufficient data (less than 12 months). Skipping account 1414734016 due to insufficient data (less than 12 months). Skipping account 1425276024 due to insufficient data (less than 12 months). Skipping account 1441261011 due to insufficient data (less than 12 months). Skipping account 1459159010 due to insufficient data (less than 12 months). Skipping account 1465847019 due to insufficient data (less than 12 months). Skipping account 1468334017 due to insufficient data (less than 12 months). Skipping account 1478452017 due to insufficient data (less than 12 months). Skipping account 1483551015 due to insufficient data (less than 12 months). Skipping account 1493191018 due to insufficient data (less than 12 months). Skipping account 1496522019 due to insufficient data (less than 12 months). Skipping account 1503760018 due to insufficient data (less than 12 months). Skipping account 1507901017 due to insufficient data (less than 12 months). Skipping account 1514231013 due to insufficient data (less than 12 months). Skipping account 1515986011 due to insufficient data (less than 12 months). Skipping account 1526777019 due to insufficient data (less than 12 months). Skipping account 1533413016 due to insufficient data (less than 12 months). Skipping account 1539422018 due to insufficient data (less than 12 months). Skipping account 1539760011 due to insufficient data (less than 12 months). Skipping account 1544487017 due to insufficient data (less than 12 months). Skipping account 1550002013 due to insufficient data (less than 12 months). Skipping account 1561203010 due to insufficient data (less than 12 months). Skipping account 1574299013 due to insufficient data (less than 12 months). Skipping account 1576866013 due to insufficient data (less than 12 months). Skipping account 1581587017 due to insufficient data (less than 12 months). Skipping account 1616267013 due to insufficient data (less than 12 months). Skipping account 1623707012 due to insufficient data (less than 12 months). Skipping account 1630380013 due to insufficient data (less than 12 months). Skipping account 1631558011 due to insufficient data (less than 12 months). Skipping account 1635474012 due to insufficient data (less than 12 months). Skipping account 1638302017 due to insufficient data (less than 12 months). Skipping account 1655657019 due to insufficient data (less than 12 months). Skipping account 1657682019 due to insufficient data (less than 12 months). Skipping account 1664382016 due to insufficient data (less than 12 months). Skipping account 1669016011 due to insufficient data (less than 12 months). Skipping account 1670290015 due to insufficient data (less than 12 months). Skipping account 1692703016 due to insufficient data (less than 12 months). Skipping account 1703110011 due to insufficient data (less than 12 months). Skipping account 1713919015 due to insufficient data (less than 12 months). Skipping account 1717765014 due to insufficient data (less than 12 months). Skipping account 1720096019 due to insufficient data (less than 12 months). Skipping account 1727968018 due to insufficient data (less than 12 months). Skipping account 1728612019 due to insufficient data (less than 12 months). Skipping account 1728756018 due to insufficient data (less than 12 months). Skipping account 1738694016 due to insufficient data (less than 12 months). Skipping account 1741834018 due to insufficient data (less than 12 months). Skipping account 1746903018 due to insufficient data (less than 12 months). Skipping account 1753350010 due to insufficient data (less than 12 months). Skipping account 1762185010 due to insufficient data (less than 12 months). Skipping account 1762728017 due to insufficient data (less than 12 months). Skipping account 1770647014 due to insufficient data (less than 12 months). Skipping account 1775250015 due to insufficient data (less than 12 months). Skipping account 1777074015 due to insufficient data (less than 12 months). Skipping account 1778623018 due to insufficient data (less than 12 months). Skipping account 1793147014 due to insufficient data (less than 12 months). Skipping account 1803595017 due to insufficient data (less than 12 months). Skipping account 1806158019 due to insufficient data (less than 12 months). Skipping account 1815552019 due to insufficient data (less than 12 months). Skipping account 1818659012 due to insufficient data (less than 12 months). Skipping account 1819457019 due to insufficient data (less than 12 months). Skipping account 1831918011 due to insufficient data (less than 12 months). Skipping account 1832218017 due to insufficient data (less than 12 months). Skipping account 1833745019 due to insufficient data (less than 12 months). Skipping account 1836474015 due to insufficient data (less than 12 months). Skipping account 1840881018 due to insufficient data (less than 12 months). Skipping account 1842141013 due to insufficient data (less than 12 months). Skipping account 1845647011 due to insufficient data (less than 12 months). Skipping account 1846326018 due to insufficient data (less than 12 months). Skipping account 1847430013 due to insufficient data (less than 12 months). Skipping account 1851305019 due to insufficient data (less than 12 months). Skipping account 1860100018 due to insufficient data (less than 12 months). Skipping account 1861508013 due to insufficient data (less than 12 months). Skipping account 1865247013 due to insufficient data (less than 12 months). Skipping account 1876925010 due to insufficient data (less than 12 months). Skipping account 1883300019 due to insufficient data (less than 12 months). Skipping account 1898189014 due to insufficient data (less than 12 months). Skipping account 1898741018 due to insufficient data (less than 12 months). Skipping account 1913075018 due to insufficient data (less than 12 months). Skipping account 1915864012 due to insufficient data (less than 12 months). Skipping account 1928120013 due to insufficient data (less than 12 months). Skipping account 1928937013 due to insufficient data (less than 12 months). Skipping account 1937668012 due to insufficient data (less than 12 months). Skipping account 1938595016 due to insufficient data (less than 12 months). Skipping account 1940073014 due to insufficient data (less than 12 months). Skipping account 1943190011 due to insufficient data (less than 12 months). Skipping account 1947357016 due to insufficient data (less than 12 months). Skipping account 1954123017 due to insufficient data (less than 12 months). Skipping account 1966814019 due to insufficient data (less than 12 months). Skipping account 1971739017 due to insufficient data (less than 12 months). Skipping account 1976668013 due to insufficient data (less than 12 months). Skipping account 1985946014 due to insufficient data (less than 12 months). Skipping account 1997076019 due to insufficient data (less than 12 months). Skipping account 2000510018 due to insufficient data (less than 12 months). Skipping account 2001644017 due to insufficient data (less than 12 months). Skipping account 2005088013 due to insufficient data (less than 12 months). Skipping account 2012981017 due to insufficient data (less than 12 months). Skipping account 2013426012 due to insufficient data (less than 12 months). Skipping account 2014958026 due to insufficient data (less than 12 months). Skipping account 2015981018 due to insufficient data (less than 12 months). Skipping account 2021883018 due to insufficient data (less than 12 months). Skipping account 2023269012 due to insufficient data (less than 12 months). Skipping account 2023914013 due to insufficient data (less than 12 months). Skipping account 2024595011 due to insufficient data (less than 12 months). Skipping account 2036492018 due to insufficient data (less than 12 months). Skipping account 2042103012 due to insufficient data (less than 12 months). Skipping account 2047034029 due to insufficient data (less than 12 months). Skipping account 2051077016 due to insufficient data (less than 12 months). Skipping account 2056660013 due to insufficient data (less than 12 months). Skipping account 2058877019 due to insufficient data (less than 12 months). Skipping account 2065160018 due to insufficient data (less than 12 months). Skipping account 2065594019 due to insufficient data (less than 12 months). Skipping account 2067607014 due to insufficient data (less than 12 months). Skipping account 2069255017 due to insufficient data (less than 12 months). Skipping account 2071562020 due to insufficient data (less than 12 months). Skipping account 2071649028 due to insufficient data (less than 12 months). Skipping account 2071850011 due to insufficient data (less than 12 months). Skipping account 2074482015 due to insufficient data (less than 12 months). Skipping account 2076951016 due to insufficient data (less than 12 months). Skipping account 2078709015 due to insufficient data (less than 12 months). Skipping account 2092373012 due to insufficient data (less than 12 months). Skipping account 2093437012 due to insufficient data (less than 12 months). Skipping account 2094447016 due to insufficient data (less than 12 months). Skipping account 2097386015 due to insufficient data (less than 12 months). Skipping account 2102831015 due to insufficient data (less than 12 months). Skipping account 2109718016 due to insufficient data (less than 12 months). Skipping account 2112284010 due to insufficient data (less than 12 months). Skipping account 2113979023 due to insufficient data (less than 12 months). Skipping account 2114096019 due to insufficient data (less than 12 months). Skipping account 2114660014 due to insufficient data (less than 12 months). Skipping account 2114679014 due to insufficient data (less than 12 months). Skipping account 2118487019 due to insufficient data (less than 12 months). Skipping account 2125105016 due to insufficient data (less than 12 months). Skipping account 2132731013 due to insufficient data (less than 12 months). Skipping account 2136439018 due to insufficient data (less than 12 months). Skipping account 2138249013 due to insufficient data (less than 12 months). Skipping account 2144192015 due to insufficient data (less than 12 months). Skipping account 2151240015 due to insufficient data (less than 12 months). Skipping account 2154072015 due to insufficient data (less than 12 months). Skipping account 2155857017 due to insufficient data (less than 12 months). Skipping account 2161540011 due to insufficient data (less than 12 months). Skipping account 2199994013 due to insufficient data (less than 12 months). Skipping account 2206614017 due to insufficient data (less than 12 months). Skipping account 2213633010 due to insufficient data (less than 12 months). Skipping account 2225690012 due to insufficient data (less than 12 months). Skipping account 2234196019 due to insufficient data (less than 12 months). Skipping account 2251039018 due to insufficient data (less than 12 months). Skipping account 2267300012 due to insufficient data (less than 12 months). Training model for inconsistent salary earners... MAE: 3209.99, RMSE: 51406.22, R-squared: 0.97
X_test_incons_scaled = scaler_incons.transform(X_test_incons)
y_pred = model_incons.predict(X_test_incons_scaled)
plt.figure(figsize=(10, 5))
plt.scatter(y_test_incons, y_pred, alpha=0.5)
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual vs. Predicted Salary")
plt.plot([min(y_test_incons), max(y_test_incons)], [min(y_test_incons), max(y_test_incons)], 'r--')
plt.show()