Machine Learning and ASER Pakistan Education DataSet


I was looking to experiment with some of the machine learning techniques I had been learning the past few months. I was particularly interested if I could glean any insights about the education system in Pakistan. Despite the efforts of the government, a large proportion of the nation is illiterate and the state of education delivery remains poor. I am of the opinion that a very focused data-driven approach would be a better method of resolving the crisis that the country faces than spending tons of money on the distribution of laptops. There are many problems which plague the schools system in the country, and the government needs to focus on the most important ones which will have the most impact if resolved.

Luckily, I found a very good comprehensive dataset by ASER - ANNUAL STATUS OF EDUCATION REPORT. They have open-sourced their entire raw dataset, which I am now sifting through. Let’s get started.

I will be using the scikit-learn stack for most things here.

First we load the csv data to memory. The structure of the csv file is provided by ASER here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


import os
path = os.getcwd() + '/../../data/2HouseholdSurvey.csv'
data = pd.read_csv(path)

Let’s check if the data loaded correctly.

data.head()
ProvinceID ProvinceName DistrictID DistrictName VillageId HouseholdId HouseType IsHouseOwned IsElectricityConnectionAvailable IsTVAvailable ... MotherAge TotalChildren TotalChildrenSeventeenAbove MotherGoneSchool MotherHighestClassCompletedClub MotherHighestClassCompleted FatherAge FatherGoneSchool FatherHighestClassCompleted FatherHighestClassCompletedClub
0 2 Punjab 151 Khanewal 96 31 3 -1.0 -1.0 -1.0 ... 29.0 3.0 0.0 0.0 NaN NaN 31.0 -1.0 8 MIDDLE
1 2 Punjab 151 Khanewal 96 31 3 -1.0 -1.0 -1.0 ... 29.0 3.0 0.0 0.0 NaN NaN 31.0 -1.0 8 MIDDLE
2 2 Punjab 151 Khanewal 96 31 3 -1.0 -1.0 -1.0 ... 29.0 3.0 0.0 0.0 NaN NaN 31.0 -1.0 8 MIDDLE
3 2 Punjab 151 Khanewal 106 30 2 -1.0 -1.0 -1.0 ... 38.0 2.0 NaN 0.0 NaN NaN 42.0 0.0 NaN NaN
4 2 Punjab 151 Khanewal 96 32 3 -1.0 -1.0 -1.0 ... 30.0 4.0 NaN 0.0 NaN NaN 35.0 -1.0 6 MIDDLE

5 rows × 52 columns

data.shape
(286570, 52)

Let’s check what are the dtypes loaded by the data. Since we already know that a lot of data is categorical, it will be a good idea to check if pandas already checked for that when it loaded the data.

data.dtypes
ProvinceID                            int64
ProvinceName                         object
DistrictID                            int64
DistrictName                         object
VillageId                             int64
HouseholdId                           int64
HouseType                             int64
IsHouseOwned                        float64
IsElectricityConnectionAvailable    float64
IsTVAvailable                       float64
IsMobileAvailable                   float64
IsSmartPhoneAvailable               float64
Car                                 float64
MotorCycle                          float64
ChildrenInMadrassah                 float64
ChildId                               int64
ChildAge                              int64
Gender                                int64
EducationStatus                       int64
SchoolDropoutClass                   object
SchoolDropoutReason                 float64
CurrentClassGrade                    object
InstituteType                       float64
IsChildGoSurveyedSchool             float64
IsChildTakingPaidTution             float64
TutionFee                           float64
ReadingHighestLevel                 float64
IsBLLBonusQ1                        float64
IsBLLBonusQ2                        float64
LanguageTested                      float64
MathHighestLevel                    float64
IsALBonusQ1                         float64
IsALBonusQ3                         float64
IsALBonusQ2                         float64
EnglishReading                      float64
IsKnowsWords                        float64
IsKnowsSentence                     float64
EngReadPoem                         float64
EngQuestions                        float64
CanName                             float64
IsChildWasAvailable                 float64
ParentId                              int64
MotherAge                           float64
TotalChildren                       float64
TotalChildrenSeventeenAbove         float64
MotherGoneSchool                    float64
MotherHighestClassCompletedClub      object
MotherHighestClassCompleted          object
FatherAge                           float64
FatherGoneSchool                    float64
FatherHighestClassCompleted          object
FatherHighestClassCompletedClub      object
dtype: object

We can see that none of the categorical data is actually loaded as it should be. Let’s fix that. We know that all predictors beginning with ‘Is’ are categorical so we can write a helper function to change the dtype of all such columns to ‘category’.

b_names = filter(lambda s: s.startswith('Is'), data.columns.values)
for col in b_names:
    data[col] = data[col].astype('category')

We still need to make the other predictors into categorical dtypes. This involves a little bit of maunal work. I went through all the categorical predictors in the dataset using their data structure in their file here.

categoryList = ['HouseType','Gender','EducationStatus','SchoolDropoutClass','SchoolDropoutReason', 'CurrentClassGrade', 'InstituteType', 'ReadingHighestLevel', 'LanguageTested','MathHighestLevel', 'EnglishReading','EngReadPoem', 'EngQuestions', 'CanName', 'MotherGoneSchool', 'MotherHighestClassCompletedClub', 'MotherHighestClassCompleted','FatherGoneSchool', 'FatherHighestClassCompleted', 'FatherHighestClassCompletedClub' ]
for col in categoryList:
    data[col] = data[col].astype('category')

Let’s check if the dtypes are now correctly set.

data.dtypes
ProvinceID                             int64
ProvinceName                          object
DistrictID                             int64
DistrictName                          object
VillageId                              int64
HouseholdId                            int64
HouseType                           category
IsHouseOwned                        category
IsElectricityConnectionAvailable    category
IsTVAvailable                       category
IsMobileAvailable                   category
IsSmartPhoneAvailable               category
Car                                  float64
MotorCycle                           float64
ChildrenInMadrassah                  float64
ChildId                                int64
ChildAge                               int64
Gender                              category
EducationStatus                     category
SchoolDropoutClass                  category
SchoolDropoutReason                 category
CurrentClassGrade                   category
InstituteType                       category
IsChildGoSurveyedSchool             category
IsChildTakingPaidTution             category
TutionFee                            float64
ReadingHighestLevel                 category
IsBLLBonusQ1                        category
IsBLLBonusQ2                        category
LanguageTested                      category
MathHighestLevel                    category
IsALBonusQ1                         category
IsALBonusQ3                         category
IsALBonusQ2                         category
EnglishReading                      category
IsKnowsWords                        category
IsKnowsSentence                     category
EngReadPoem                         category
EngQuestions                        category
CanName                             category
IsChildWasAvailable                 category
ParentId                               int64
MotherAge                            float64
TotalChildren                        float64
TotalChildrenSeventeenAbove          float64
MotherGoneSchool                    category
MotherHighestClassCompletedClub     category
MotherHighestClassCompleted         category
FatherAge                            float64
FatherGoneSchool                    category
FatherHighestClassCompleted         category
FatherHighestClassCompletedClub     category
dtype: object

Now that our dataset is ready for manipulation, let’s look at what insights we can gain from the dataset. We will start from some small things.

pd.crosstab(index=data["ReadingHighestLevel"], columns="count")
col_0 count
ReadingHighestLevel
1.0 36243
2.0 31866
3.0 37892
4.0 30496
5.0 64247
def plotCategorical(dataframe, xl, yl):
    ax = dataframe.value_counts().sort_index().plot(kind="bar")
    ax.set(xlabel=xl, ylabel=yl)
plotCategorical(data["ReadingHighestLevel"], "Reading Level - 5 is hishest", "Students")

png

plotCategorical(data["SchoolDropoutClass"], "Class Dropped Out In (some invalid data which are floats)", "Students")

png

Some information that can be gleaned by a cursory glance at the data can be visualised above. EducationStatus-5 is for students who can read a whole story. From the chart above it can be seen there are more students who can not read a whole story than those who can. 5th grade is when students ususally drop out the most (by a significant margin).

Let’s now look at something else.

pd.crosstab(index=[data['Gender'],data['EducationStatus']], columns='count')
col_0 count
Gender EducationStatus
-1 1 37299
2 5619
3 75580
0 1 32161
2 6775
3 129136

Gender:

  • Female -> -1
  • Male -> 0

Here we can see that the a lot more boys than girls are enrolled (3). But on the other hand, the number of boys and girls who dropped out (2) or never enrolled (1) is close. This means that our gender data itself is skewed. Let’s take a look:

pd.crosstab(index=data['Gender'], columns='count')
col_0 count
Gender
-1 118498
0 168072

There are nearly 50,000 more boys in the dataset than there are girls. This is probably due to cultural sensitivities where households do not report girls to surveys. This needs further investigation, and I don’t see any way I can corroborate this using data within this dataset.

Let’s look at if we can prepare a model for gender vs education status. A logistic regression would be suitable in this case.

from sklearn.linear_model import LogisticRegression as LR
model = LR()
X = pd.get_dummies(data[['EducationStatus']]) #create dummy variables for X
y = data[['Gender']].replace(0,1)
y = np.ravel(y)
model.fit(X,y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
pd.DataFrame(list(zip(X.columns, np.transpose(np.around(model.coef_, 3)))))
0 1
0 EducationStatus_1 [-0.292]
1 EducationStatus_2 [0.043]
2 EducationStatus_3 [0.392]
model.score(X,y)
0.60442474788009914

This gives us a 60% accuracy on our training dataset. It can be seen clearly from the coefficients of the regression that being a boy means a higher chance of being enrolled (a positive value for EducationStatus_3). Also, EducationStatus_0 has a higher chance of being a girl rather than a boy since it pushes the value towards the negative (a girl).

This approach is not so robust. Let’s do another regression, with a validation set this time.

from sklearn.cross_validation import train_test_split
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, train_size=0.6, random_state=0)
model_validated = LR()
model_validated.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
predicted = model_validated.predict(X_test)
metrics.accuracy_score(y_test, predicted)
0.6039536587919182

The validation set approach is still doing a 60% accurate job at predicting if a student with a given EducationStatus is a boy or a girl. This is better than 50% (a random guess). But this is also not good because guessing it is a boy for all of them would give us a 58% accuracy anyway (since 58% of all population is boys).

pd.DataFrame(list(zip(X.columns, np.transpose(np.around(model_validated.coef_, 3)))))
0 1
0 EducationStatus_1 [-0.299]
1 EducationStatus_2 [0.066]
2 EducationStatus_3 [0.385]

-Taha

comments powered by Disqus