# Machine Learning and ASER Pakistan Education DataSet

^{11}/Sep 2016

I was looking to experiment with some of the machine learning techniques I had been learning the past few months. I was particularly interested if I could glean any insights about the education system in Pakistan. Despite the efforts of the government, a large proportion of the nation is illiterate and the state of education delivery remains poor. I am of the opinion that a very focused data-driven approach would be a better method of resolving the crisis that the country faces than spending tons of money on the distribution of laptops. There are many problems which plague the schools system in the country, and the government needs to focus on the most important ones which will have the most impact if resolved.

Luckily, I found a very good comprehensive dataset by ASER - ANNUAL STATUS OF EDUCATION REPORT. They have open-sourced their entire raw dataset, which I am now sifting through. Let’s get started.

I will be using the scikit-learn stack for most things here.

First we load the csv data to memory. The structure of the csv file is provided by ASER here.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
path = os.getcwd() + '/../../data/2HouseholdSurvey.csv'
data = pd.read_csv(path)
```

Let’s check if the data loaded correctly.

```
data.head()
```

ProvinceID | ProvinceName | DistrictID | DistrictName | VillageId | HouseholdId | HouseType | IsHouseOwned | IsElectricityConnectionAvailable | IsTVAvailable | ... | MotherAge | TotalChildren | TotalChildrenSeventeenAbove | MotherGoneSchool | MotherHighestClassCompletedClub | MotherHighestClassCompleted | FatherAge | FatherGoneSchool | FatherHighestClassCompleted | FatherHighestClassCompletedClub | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 2 | Punjab | 151 | Khanewal | 96 | 31 | 3 | -1.0 | -1.0 | -1.0 | ... | 29.0 | 3.0 | 0.0 | 0.0 | NaN | NaN | 31.0 | -1.0 | 8 | MIDDLE |

1 | 2 | Punjab | 151 | Khanewal | 96 | 31 | 3 | -1.0 | -1.0 | -1.0 | ... | 29.0 | 3.0 | 0.0 | 0.0 | NaN | NaN | 31.0 | -1.0 | 8 | MIDDLE |

2 | 2 | Punjab | 151 | Khanewal | 96 | 31 | 3 | -1.0 | -1.0 | -1.0 | ... | 29.0 | 3.0 | 0.0 | 0.0 | NaN | NaN | 31.0 | -1.0 | 8 | MIDDLE |

3 | 2 | Punjab | 151 | Khanewal | 106 | 30 | 2 | -1.0 | -1.0 | -1.0 | ... | 38.0 | 2.0 | NaN | 0.0 | NaN | NaN | 42.0 | 0.0 | NaN | NaN |

4 | 2 | Punjab | 151 | Khanewal | 96 | 32 | 3 | -1.0 | -1.0 | -1.0 | ... | 30.0 | 4.0 | NaN | 0.0 | NaN | NaN | 35.0 | -1.0 | 6 | MIDDLE |

5 rows × 52 columns

```
data.shape
```

```
(286570, 52)
```

Let’s check what are the dtypes loaded by the data. Since we already know that a lot of data is categorical, it will be a good idea to check if pandas already checked for that when it loaded the data.

```
data.dtypes
```

```
ProvinceID int64
ProvinceName object
DistrictID int64
DistrictName object
VillageId int64
HouseholdId int64
HouseType int64
IsHouseOwned float64
IsElectricityConnectionAvailable float64
IsTVAvailable float64
IsMobileAvailable float64
IsSmartPhoneAvailable float64
Car float64
MotorCycle float64
ChildrenInMadrassah float64
ChildId int64
ChildAge int64
Gender int64
EducationStatus int64
SchoolDropoutClass object
SchoolDropoutReason float64
CurrentClassGrade object
InstituteType float64
IsChildGoSurveyedSchool float64
IsChildTakingPaidTution float64
TutionFee float64
ReadingHighestLevel float64
IsBLLBonusQ1 float64
IsBLLBonusQ2 float64
LanguageTested float64
MathHighestLevel float64
IsALBonusQ1 float64
IsALBonusQ3 float64
IsALBonusQ2 float64
EnglishReading float64
IsKnowsWords float64
IsKnowsSentence float64
EngReadPoem float64
EngQuestions float64
CanName float64
IsChildWasAvailable float64
ParentId int64
MotherAge float64
TotalChildren float64
TotalChildrenSeventeenAbove float64
MotherGoneSchool float64
MotherHighestClassCompletedClub object
MotherHighestClassCompleted object
FatherAge float64
FatherGoneSchool float64
FatherHighestClassCompleted object
FatherHighestClassCompletedClub object
dtype: object
```

We can see that none of the categorical data is actually loaded as it should be. Let’s fix that. We know that all predictors beginning with ‘Is’ are categorical so we can write a helper function to change the dtype of all such columns to ‘category’.

```
b_names = filter(lambda s: s.startswith('Is'), data.columns.values)
for col in b_names:
data[col] = data[col].astype('category')
```

We still need to make the other predictors into categorical dtypes. This involves a little bit of maunal work. I went through all the categorical predictors in the dataset using their data structure in their file here.

```
categoryList = ['HouseType','Gender','EducationStatus','SchoolDropoutClass','SchoolDropoutReason', 'CurrentClassGrade', 'InstituteType', 'ReadingHighestLevel', 'LanguageTested','MathHighestLevel', 'EnglishReading','EngReadPoem', 'EngQuestions', 'CanName', 'MotherGoneSchool', 'MotherHighestClassCompletedClub', 'MotherHighestClassCompleted','FatherGoneSchool', 'FatherHighestClassCompleted', 'FatherHighestClassCompletedClub' ]
for col in categoryList:
data[col] = data[col].astype('category')
```

Let’s check if the dtypes are now correctly set.

```
data.dtypes
```

```
ProvinceID int64
ProvinceName object
DistrictID int64
DistrictName object
VillageId int64
HouseholdId int64
HouseType category
IsHouseOwned category
IsElectricityConnectionAvailable category
IsTVAvailable category
IsMobileAvailable category
IsSmartPhoneAvailable category
Car float64
MotorCycle float64
ChildrenInMadrassah float64
ChildId int64
ChildAge int64
Gender category
EducationStatus category
SchoolDropoutClass category
SchoolDropoutReason category
CurrentClassGrade category
InstituteType category
IsChildGoSurveyedSchool category
IsChildTakingPaidTution category
TutionFee float64
ReadingHighestLevel category
IsBLLBonusQ1 category
IsBLLBonusQ2 category
LanguageTested category
MathHighestLevel category
IsALBonusQ1 category
IsALBonusQ3 category
IsALBonusQ2 category
EnglishReading category
IsKnowsWords category
IsKnowsSentence category
EngReadPoem category
EngQuestions category
CanName category
IsChildWasAvailable category
ParentId int64
MotherAge float64
TotalChildren float64
TotalChildrenSeventeenAbove float64
MotherGoneSchool category
MotherHighestClassCompletedClub category
MotherHighestClassCompleted category
FatherAge float64
FatherGoneSchool category
FatherHighestClassCompleted category
FatherHighestClassCompletedClub category
dtype: object
```

Now that our dataset is ready for manipulation, let’s look at what insights we can gain from the dataset. We will start from some small things.

```
pd.crosstab(index=data["ReadingHighestLevel"], columns="count")
```

col_0 | count |
---|---|

ReadingHighestLevel | |

1.0 | 36243 |

2.0 | 31866 |

3.0 | 37892 |

4.0 | 30496 |

5.0 | 64247 |

```
def plotCategorical(dataframe, xl, yl):
ax = dataframe.value_counts().sort_index().plot(kind="bar")
ax.set(xlabel=xl, ylabel=yl)
```

```
plotCategorical(data["ReadingHighestLevel"], "Reading Level - 5 is hishest", "Students")
```

```
plotCategorical(data["SchoolDropoutClass"], "Class Dropped Out In (some invalid data which are floats)", "Students")
```

Some information that can be gleaned by a cursory glance at the data can be visualised above. EducationStatus-5 is for students who can read a whole story. From the chart above it can be seen there are more students who can not read a whole story than those who can. 5th grade is when students ususally drop out the most (by a significant margin).

Let’s now look at something else.

```
pd.crosstab(index=[data['Gender'],data['EducationStatus']], columns='count')
```

col_0 | count | |
---|---|---|

Gender | EducationStatus | |

-1 | 1 | 37299 |

2 | 5619 | |

3 | 75580 | |

0 | 1 | 32161 |

2 | 6775 | |

3 | 129136 |

Gender:

- Female -> -1
- Male -> 0

Here we can see that the a lot more boys than girls are enrolled (3). But on the other hand, the number of boys and girls who dropped out (2) or never enrolled (1) is close. This means that our gender data itself is skewed. Let’s take a look:

```
pd.crosstab(index=data['Gender'], columns='count')
```

col_0 | count |
---|---|

Gender | |

-1 | 118498 |

0 | 168072 |

There are nearly 50,000 more boys in the dataset than there are girls. This is probably due to cultural sensitivities where households do not report girls to surveys. This needs further investigation, and I don’t see any way I can corroborate this using data within this dataset.

Let’s look at if we can prepare a model for gender vs education status. A logistic regression would be suitable in this case.

```
from sklearn.linear_model import LogisticRegression as LR
model = LR()
X = pd.get_dummies(data[['EducationStatus']]) #create dummy variables for X
y = data[['Gender']].replace(0,1)
y = np.ravel(y)
model.fit(X,y)
```

```
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
```

```
pd.DataFrame(list(zip(X.columns, np.transpose(np.around(model.coef_, 3)))))
```

0 | 1 | |
---|---|---|

0 | EducationStatus_1 | [-0.292] |

1 | EducationStatus_2 | [0.043] |

2 | EducationStatus_3 | [0.392] |

```
model.score(X,y)
```

```
0.60442474788009914
```

This gives us a 60% accuracy on our training dataset. It can be seen clearly from the coefficients of the regression that being a boy means a higher chance of being enrolled (a positive value for EducationStatus_3). Also, EducationStatus_0 has a higher chance of being a girl rather than a boy since it pushes the value towards the negative (a girl).

This approach is not so robust. Let’s do another regression, with a validation set this time.

```
from sklearn.cross_validation import train_test_split
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, train_size=0.6, random_state=0)
```

```
model_validated = LR()
model_validated.fit(X_train, y_train)
```

```
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
```

```
predicted = model_validated.predict(X_test)
metrics.accuracy_score(y_test, predicted)
```

```
0.6039536587919182
```

The validation set approach is still doing a 60% accurate job at predicting if a student with a given EducationStatus is a boy or a girl. This is better than 50% (a random guess). But this is also not good because guessing it is a boy for all of them would give us a 58% accuracy anyway (since 58% of all population is boys).

```
pd.DataFrame(list(zip(X.columns, np.transpose(np.around(model_validated.coef_, 3)))))
```

0 | 1 | |
---|---|---|

0 | EducationStatus_1 | [-0.299] |

1 | EducationStatus_2 | [0.066] |

2 | EducationStatus_3 | [0.385] |

-Taha