This project explores the βU.S. Medical Insurance datasetβ to understand which factors influence individual insurance pricing. The analysis includes data exploration, visualization, and a multiple linear regression model using statsmodels
.
The dataset contains 1,338 observations with the following variables:
age
: Age of the individualsex
: Male / Femalebmi
: Body mass indexchildren
: Number of children/dependentssmoker
: Yes / Noregion
: Residential region in the U.S.charges
: Medical costs billedExample visualization:
I built a multiple linear regression model including all variables (categorical variables converted to dummies).
Key results:
age
: Older individuals tend to have higher charges.bmi
: Higher BMI is linked to higher charges.children
: More children slightly increases charges.smoker
: Smoking has the largest effect, increasing charges by ~23,850 USD on average.pandas
, os
, matplotlib
, seaborn
, statsmodels
)statsmodels
.