Projects

PUBG Game Winner Prediction

Goal: Predict a player's win placement percentage (winPlacePerc) using in-game performance metrics to identify factors that influence winning probability.

Dataset

Source: Proprietary dataset from Rubixe AI Solutions, 4,436,306 rows × 33 columns. Real-world gameplay statistics. Target: winPlacePerc.

Tools & Techniques

Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, CatBoost. Preprocessing, feature engineering (walkDistance, headshot_rate, totalDistance, healsnboosts), model training and comparison.

Implementation

  • Exploratory analysis and sampling strategy to handle scale.
  • Feature engineering to normalize metrics across match types.
  • Trained CatBoostRegressor, RandomForest and XGBoost, tuned with cross-validation.

Key Findings

  • Strong positive correlation between walkDistance, damageDealt and winPlacePerc.
  • Kills and headshot kills are high-impact features. Excessive healing correlates with defensive play.
  • Best model: CatBoostRegressor, R² = 0.93 (RMSE ≈ 0.08).

Challenges & Solutions

  • Scale: used representative sampling and optimized Pandas operations to iterate fast.
  • Match-type imbalance: created normalized features (killsNorm, damageDealtNorm, matchDurationNorm).

Credit Score Classification

Goal: Classify customers into Good / Bad credit history to support lending decisions and reduce default risk.

Dataset

Source: Proprietary dataset from GoodCredit Bank via internship. Size: 23,896 rows × 92 columns. Target variable: Bad_label (0 = Good, 1 = Bad).

Tools & Techniques

Python, Pandas, Scikit-learn, Gradient Boosting. Data cleaning, feature selection, SMOTE for imbalance handling, model evaluation using confusion matrix, precision, recall, F1, and Gini.

Implementation

  • Feature selection and type conversion to address high dimensionality.
  • Applied SMOTE and robust cross-validation.
  • Tested Logistic Regression, Decision Tree, KNN, and Gradient Boosting; tuned hyperparameters for best performance.

Key Findings

  • Gradient Boosting performed best, with accuracy near 93–95% and Gini = 1.0 on the evaluation dataset.
  • Top predictors: credit limit, current balance, and payment frequency.
  • EDA showed customers with low past due amounts and stable credit limits have better credit outcomes.

Challenges & Solutions

  • High dimensionality: used feature selection and removed low-variance fields.
  • Missing values: median/mode imputation and domain-driven replacements.

Hospital Stay Duration Prediction

Goal: Predict patient length of stay to help hospitals optimize bed management and resources.

Dataset

Size: 318,438 rows × 18 columns. Mixed numerical and categorical healthcare data from Rubixe AI Solutions internship.

Tools & Techniques

Python, Pandas, Scikit-learn, Matplotlib, Seaborn. Preprocessing included label encoding for ordinal features, scaling, and resampling to address class imbalance.

Implementation

  • Exploratory Data Analysis to understand length-of-stay patterns by severity, admission type, and ward.
  • Tried Logistic Regression, Decision Tree, KNN, and ensembles; tuned with GridSearchCV and RandomizedSearchCV.

Key Findings

  • Emergency admissions and higher illness severity correlated with longer stays.
  • Department and ward type affect average stay duration significantly.
  • Ensemble model achieved ≈83% testing accuracy after tuning.

Challenges & Solutions

  • Class imbalance: applied resampling prior to training.
  • Overfitting: tuned tree-based hyperparameters to improve generalization.

Sales Performance Dashboard (Power BI)

Goal: Build an interactive dashboard to analyze revenue, customers, and product performance for business decision-making.

Dataset

Synthetic practice dataset with Sales_Data, Customer_Data, Products_Data, Regions_Table. Approximately 5,000 rows across tables.

Tools & Techniques

Power BI Desktop, Power Query, DAX for calculated measures, interactive visuals and slicers for ad-hoc analysis.

Implementation

  • Designed star-schema model and cleaned data in Power Query.
  • Implemented DAX measures for Profit, Revenue, and KPI tracking.
  • Built interactive visuals: bar charts, maps, KPI cards, and slicers for dynamic filtering.

Key Findings

  • Top customers and regions concentrated most revenue; top products contributed majority of sales.
  • Distributor channel produced highest revenue share relative to export and wholesale.
  • Interactive slicers enabled fast ad-hoc answers for managers.

Challenges & Solutions

  • Missing profit column: created Profit measure using DAX.
  • Formatting and layout: standardized visuals and alignment to improve consumption by business users.