Projects
PUBG Game Winner Prediction
Goal: Predict a player's win placement percentage (winPlacePerc) using in-game performance metrics to identify factors that influence winning probability.
Dataset
Source: Proprietary dataset from Rubixe AI Solutions, 4,436,306 rows × 33 columns. Real-world gameplay statistics. Target: winPlacePerc.
Tools & Techniques
Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, CatBoost. Preprocessing, feature engineering (walkDistance, headshot_rate, totalDistance, healsnboosts), model training and comparison.
Implementation
- Exploratory analysis and sampling strategy to handle scale.
- Feature engineering to normalize metrics across match types.
- Trained CatBoostRegressor, RandomForest and XGBoost, tuned with cross-validation.
Key Findings
- Strong positive correlation between walkDistance, damageDealt and winPlacePerc.
- Kills and headshot kills are high-impact features. Excessive healing correlates with defensive play.
- Best model: CatBoostRegressor, R² = 0.93 (RMSE ≈ 0.08).
Challenges & Solutions
- Scale: used representative sampling and optimized Pandas operations to iterate fast.
- Match-type imbalance: created normalized features (killsNorm, damageDealtNorm, matchDurationNorm).
Repository: github.com/Afnitha701/PUBG_Winner_Prediction
Credit Score Classification
Goal: Classify customers into Good / Bad credit history to support lending decisions and reduce default risk.
Dataset
Source: Proprietary dataset from GoodCredit Bank via internship. Size: 23,896 rows × 92 columns. Target variable: Bad_label (0 = Good, 1 = Bad).
Tools & Techniques
Python, Pandas, Scikit-learn, Gradient Boosting. Data cleaning, feature selection, SMOTE for imbalance handling, model evaluation using confusion matrix, precision, recall, F1, and Gini.
Implementation
- Feature selection and type conversion to address high dimensionality.
- Applied SMOTE and robust cross-validation.
- Tested Logistic Regression, Decision Tree, KNN, and Gradient Boosting; tuned hyperparameters for best performance.
Key Findings
- Gradient Boosting performed best, with accuracy near 93–95% and Gini = 1.0 on the evaluation dataset.
- Top predictors: credit limit, current balance, and payment frequency.
- EDA showed customers with low past due amounts and stable credit limits have better credit outcomes.
Challenges & Solutions
- High dimensionality: used feature selection and removed low-variance fields.
- Missing values: median/mode imputation and domain-driven replacements.
Repository: github.com/Afnitha701/Credit_Card_Fraud_Detection
Hospital Stay Duration Prediction
Goal: Predict patient length of stay to help hospitals optimize bed management and resources.
Dataset
Size: 318,438 rows × 18 columns. Mixed numerical and categorical healthcare data from Rubixe AI Solutions internship.
Tools & Techniques
Python, Pandas, Scikit-learn, Matplotlib, Seaborn. Preprocessing included label encoding for ordinal features, scaling, and resampling to address class imbalance.
Implementation
- Exploratory Data Analysis to understand length-of-stay patterns by severity, admission type, and ward.
- Tried Logistic Regression, Decision Tree, KNN, and ensembles; tuned with GridSearchCV and RandomizedSearchCV.
Key Findings
- Emergency admissions and higher illness severity correlated with longer stays.
- Department and ward type affect average stay duration significantly.
- Ensemble model achieved ≈83% testing accuracy after tuning.
Challenges & Solutions
- Class imbalance: applied resampling prior to training.
- Overfitting: tuned tree-based hyperparameters to improve generalization.
Repository: github.com/Afnitha701/Hospital_Stay_Duration_Prediction
Sales Performance Dashboard (Power BI)
Goal: Build an interactive dashboard to analyze revenue, customers, and product performance for business decision-making.
Dataset
Synthetic practice dataset with Sales_Data, Customer_Data, Products_Data, Regions_Table. Approximately 5,000 rows across tables.
Tools & Techniques
Power BI Desktop, Power Query, DAX for calculated measures, interactive visuals and slicers for ad-hoc analysis.
Implementation
- Designed star-schema model and cleaned data in Power Query.
- Implemented DAX measures for Profit, Revenue, and KPI tracking.
- Built interactive visuals: bar charts, maps, KPI cards, and slicers for dynamic filtering.
Key Findings
- Top customers and regions concentrated most revenue; top products contributed majority of sales.
- Distributor channel produced highest revenue share relative to export and wholesale.
- Interactive slicers enabled fast ad-hoc answers for managers.
Challenges & Solutions
- Missing profit column: created Profit measure using DAX.
- Formatting and layout: standardized visuals and alignment to improve consumption by business users.
Repository: github.com/Afnitha701/Power-BI-Sales-Performance-Dashboard