Feature Engineering Notes 1/17/2023

There are times when we need to create a few additional features from categorical data. Below is an example of a dataframe that I was working with:

id	gender	age	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
13244	Female	75	Yes	Private	Urban	100.29	30.5	never smoked	0
10109	Male	80	Yes	Self-employed	Rural	76.12	20.3	Unknown	1
3111	Female	14	No	Private	Urban	72.18	31.5	Unknown	0
11751	Female	35	Yes	Self-employed	Urban	89.88	28.8	never smoked	0
8041	Female	49	Yes	Private	Urban	102.97	25.5	smokes	0

As we can see, the columns gender, ever_married, work_type, Residence_type and smoking_status are categorical.

Value greater than threshold Let’s say we want to create a feature that takes the bmi values (continuous feature). This can be done as below:

df['Obese'] = np.where(df['bmi']>25.0, 1, 0)

It was quick using the np.where function and passing it a condition to check. If the condition is true, we assign it 1 and if it is not met, we assign it 0.
Multiple conditions Let’s say we want to create a feature for male or female with hypertension as another feature. This can be done as below:

df['MaleXhypertension']= np.where(((df['gender']=='Male') & (df['hypertension']==1)),1,0)

df['FemaleXhypertension']= np.where(((df['gender']=='Female') & (df['hypertension']==1)),1,0)

In this case, it is checking for the condition that the gender is ‘Male’ and also that the indicator is 1 for hypertension. Note the use of & to combine the two conditions.

Mixing greater than threshold and indicator flag This is another extension of the first two. We can check for values exceeding threshold and also check a flag.

df['OldXObese'] = np.where(((df['age']>45) & (df['Obese']==1)),1,0)
Creating a continuous feature from existing continuous features This one is very easy with pandas:

df['AgeRatioBMI'] = df['age'] / df['bmi']

df['AgeXBMI'] = df['age'] * df['bmi']

All the above code can be combined into a simple function as below:

def make_additional_features(df):
    df['Obese'] = np.where(df['bmi']>25.0, 1, 0)
    df['MaleXhypertension']= np.where(((df['gender']=='Male') & (df['hypertension']==1)),1,0)
    df['FemaleXhypertension']= np.where(((df['gender']=='Female') & (df['hypertension']==1)),1,0)
    df['OldXObese'] = np.where(((df['age']>45) & (df['Obese']==1)),1,0)
    df['AgeRatioBMI'] = df['age'] / df['bmi']
    df['AgeXBMI'] = df['age'] * df['bmi']

    return df


df = make_additional_features(df)

Checking the results that we get for some of the samples, we can see

id	gender	age	hypertension	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke	Obese	FemaleXhypertension	OldXObese	AgeRatioBMI	AgeXBMI
2858	Female	79	0	Yes	Self-employed	Urban	70.58	25.6	Unknown	0	1	0	1	3.08594	2022.4
7338	Female	82	0	Yes	Self-employed	Urban	80.43	30.3	smokes	0	1	0	1	2.70627	2484.6
9404	Female	19	0	No	Private	Rural	110.72	25.4	smokes	0	1	0	0	0.748031	482.6
13868	Female	26	0	No	Private	Urban	112.54	33.1	Unknown	0	1	0	0	0.785498	860.6
2139	Male	67	0	Yes	Self-employed	Urban	69.61	27.3	formerly smoked	0	1	0	1	2.45421	1829.1
5439	Female	81	0	Yes	Private	Rural	78.16	29.6	formerly smoked	1	1	0	1	2.73649	2397.6
5323	Female	49	0	Yes	Private	Rural	85.33	25.5	smokes	0	1	0	1	1.92157	1249.5
10993	Male	8	0	No	children	Urban	89.44	18.4	Unknown	0	0	0	0	0.434783	147.2
14404	Female	59	0	No	Private	Urban	96.26	45.7	never smoked	1	1	0	1	1.29103	2696.3
884	Female	24	0	No	Private	Rural	70.01	27.9	never smoked	0	1	0	0	0.860215	669.6
3740	Female	63	1	Yes	Self-employed	Rural	95.06	34.3	never smoked	0	1	1	1	1.83673	2160.9

These types of features can help with improving our machine learning models.

#featureengineering #machinelearning

Gopakumar Gopinathan