Using Python to Prepare Data for Machine Learning Algos

Loading... • Jason Ladd

So you might be used to working with rows of data that came from a nice, clean SQL table that enforced certain rules like NOT NULL or ensured that only certain data types were input into certain fields. This is probably the case if you’re working with data where you had control over the process of how it was saved. But in the world of Data Science, there are times when data doesn’t look like you want or need it to. For example, maybe you need numbers, but the data is stored as strings. Or maybe you have missing or NaN values that would cause your program to throw errors if you tried to do arithmetic on them.

Let’s take a look at a few common cases where we need to clean up data so that it can be ready to be processed by machine learning algorithms.

The first thing to do when you get some data is to load it into a Pandas DataFrame and do some preliminary inspections to see what your overall data looks like. Some things that are useful are using the .head() method to view the first 5 rows of the dataframe, .info() to get a look at which columns you have in the DataFrame and what their data types are, .describe() to get some metrics on the min, max, mean, and standard deviation, and .isnull().sum() to find the sums of all the null values in each column. Here’s an example of printing those to the console:

import pandas as pd

df = pd.read_csv('data/MyData.csv')

print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())

This will give you a good idea of how your data looks and what you might need to do to massage the data so that it can be digested by your Machine Learning Algorithms. Another example of something you might notice is that there could be values like user_id that are different for each row. Those values would not be useful for any ML algorithms because the fact that each one is different means there is no pattern the algorithm can learn from. So in cases like that, we might want to just drop that column completely. Like this:

df = df.drop(columns=['user_id'])

This line of code re-assigns our dataframe to one where the user_id column has been omitted.

Missing Values

It’s pretty common that you might have missing values. Say for example, you gave out a survey and you had an ‘age’ column, but some participants declined to fill in this value. Now, you could drop those rows if you wanted, or you might want to take another approach. In this case, you could fill in the missing values with the average age of all the participants. .isnull() will show you these. One way to remedy this is to find out what the average age of all the participants is and insert that for the missing ones. You could do that like this:

df["Age"] = df["Age"].fillna(df["Age"].mean())

Missing Categorical Data

There might be cases where you have categorical data, for example let’s say, “car color”. If that’s missing, maybe you want to fill in the blanks with the most common color since the fact that it’s the most common means any that we don’t know have a higher probability of being that color. We could do that like this:

df["Color”] = df["Color"].fillna(df["Color"].mode()[0])

It could also be the case that you have a column with some options and none of the options were particularly common or uncommon in the dataset. In those cases, it could be a good idea to choose a random choice of the available values in your dataset. So for example, using color again:

You could first find out what are all the possible values for your column using the .unique() function

print(df[‘Color’].unique())
// prints Red, Blue, Green

df[‘Color’] = df["Color"].fillna(df["Color"].apply(lambda x: np.random.choice([“Red”, “Blue”, “Green”]) if pd.isna(x) else x)

This code uses a lambda function along with Pandas’ .apply() method to assign each value where Color is null to a random choice between Red, Blue, and Green.

One Hot Encoding

There are some cases where you can’t use categorical data at all and need your data to be represented only as numerical data. In those scenarios, you can use one hot encoding to transform single categorical columns into multiple numerical columns. For example, instead of having a column where the available values are ’Red’, ‘Green’, ‘Blue’ You could have one that is is_Red another that is is_Blue, and then one more that is is_Green. Each one of this columns would then only hold a one or a zero to indicate the categorical value that was set on that row. Here’s an example of how to do one hot encoding with Pandas’ get_dummies() function… I know, weird name lol. But yeah:

categorical_cols = [‘Color’, ‘Vehicle_Type']

df_encoded = pd.get_dummies(df, columns=categorical_cols, prefix=categorical_cols)

After doing all this, our data will finally be ready to use to train a machine learning model.

Using Python to Prepare Data for Machine Learning Algos

Missing Values

Missing Categorical Data

One Hot Encoding

Discussion (0 Comments)

Leave a Comment

How to Build a Blog Comments System Using Go, Postgres, and Vanilla Javascript

Sorting Algorithms in Go

Binary Search in Go