How to Clean Your Data in Python?

Wynand asked: 2024-07-23 Data Science , Data Cleaning , Python Guides And Tutorials , Deep Dives

Learn how to clean your data in Python with our comprehensive guide! Master data cleaning techniques & prepare your data for analysis. Read now!

Expert Answer

John Williams answered . 2025-07-04 10:23:59

I hope that this article serves as a starting point for you to learn how to clean your data efficiently to kickstart your personal projects.

Look into your data
Look at the proportion of missing data
Check the data type of each column
If you have columns of strings, check for trailing whitespaces
Dealing with Missing Values (NaN Values)
Extracting more information from your dataset to get more variables
Check the unique values of columns

Step 1: Look into your data

Before even performing any cleaning or manipulation of your dataset, you should take a glimpse at your data to understand what variables you’re working with, how the values are structured based on the column they’re in, and maybe you could have a rough idea of the inconsistencies that you’ll need to address or they’ll be cumbersome in the analysis phase. Here, you might also be able to eliminate certain columns that you won’t need depending on the analysis you want to do.

1. Print the first few rows of your dataset

Here, I printed the first 7 rows of my dataset, but you can print 5 or 10. I recommend keeping it to anything less than 10 or else it’ll be too overwhelming for what you’re currently trying to do–a quick glimpse of the dataset.

Doing this will give you a good idea of what data types you might be dealing with, what columns you need to perform transformations or cleaning, and other data you might be able to extract.

Before we look at this more closely, let’s perform the next step.

2. Save the variables to a list

You want to do this to have easy access to the different columns of the dataset, especially when you want to perform the same transformations to different subsets of columns.

3. Note down potential issues you will have to address in each column.

To stay organized, note the issues you see in your dataset (by taking a glimpse of your dataset like in Step 1).

This picture above represents what I can see just from glimpsing at the dataset and is something that you should think about when you’re looking at your dataset. Here are a few things that stand out to me:

There are some columns with missing values. This could cause a lot of problems for analysis and plotting if not addressed and resolved early in the process.
There are columns with words and numbers, such as date_added and duration. This can be a problem if we want to make time-series graphs by the date, or other plots to explore duration’s relationship with other variables.
There are 2 columns with multiple distinct words joined together by a comma. This is an issue if we want to make plots exploring the distribution of listed_in (genre) or the actors on Netflix.
Other columns could potentially have missing values. The next step looks at the way to check which columns have missing values and how much missing data they have.

Step 2: Look at the proportion of missing data

From this code chunk, you can easily look at the distribution of missing values in the dataset to get a good idea of which columns you’ll need to work with to resolve the missing values issue.

From the output, these are insights you can gather:

director column has the highest percentage of missing data ~ 30%
cast and country column also has a considerable percentage of missing data ~ 9%
date_added, rating and duration don’t have that much missing data ~ 0% - 0.1%
Fortunately, most other columns are not empty.

Your next question is probably, how do I deal with these columns with missing values?

There are a few ways to deal with it:

Drop the column completely. If the column isn’t that important to your analysis, just drop it.
Keep the column. In this case, because the director, cast and country columns are quite important to my analysis, I will keep them.
Imputation — the process of replacing missing data with substituted values. Here, it is not possible to do so because most of the data are string values and not numerical values. However, I will be writing an article that talks more about imputation in detail, why and when it should be used, and how you can use it in R and Python with the help of some packages.

Before I continue, I will bring up the issue of missing values across rows.

In some cases, you might want to examine the distribution of missing values across all the rows of your dataset (given that your dataset doesn’t have a large number of observations/rows). From here, you can choose from the choices above depending on how important the rows are to your analysis. For instance, your dataset contains recorded data of something that is changing over time. Even though a row can contain missing values, you might not want to eliminate it because there is important time information you want to retain.

Let’s continue to step 3 before I show you how to deal with the NaN values even after keeping the columns.

Step 3: Check the data type of each column

Here, you can see that all the columns have object as their datatype aside from release_year. In pandas, object means either string or mixed type (numerical and non-numerical type mixed). And from our dataset, you’ll be able to tell which columns are strictly string and mixed type.

Step 4: If you have columns of strings, check for trailing whitespaces

After we know which data types we are dealing with, let’s make sure we remove any trailing characters and whitespace using strip .

Step 5: Dealing with Missing Values (NaN Values)

Referring back to the columns of missing values, let’s take a look at the columns: director, cast, country, date_added, rating, duration. We can segment these columns by whether they are a string or mixed type.

String: director, cast, country, rating (here, it’s a string and not mixed because the numerical values won’t have any meaning if separated)

Mixed: date_added, duration

NaN means Not a Number in pandas. It is a special floating-point value that is different from NoneType in Python. NaN values can be annoying to work with, especially when you want to filter them out for plots or analysis. To make our lives easier, let’s replace these NaN values with something else.

For string type values, we can replace NaN values with “” or “None” or any string that can indicate to you that there isn’t any value in that entry. Here, I chose to replace it with “” using the fillna function. Because it’s not an in-place function, I reassigned the changed values to the column in the dataset.

Here, you must have noticed that I left out the duration column. This is because we’ll be doing something with that column later down the road.

Step 6: See if there are any other variables that you can obtain by extracting them from other variables

For mixed-type values, before we tackle the missing value issue, let’s see if we can extract any data to make our analysis richer or process easier.

Looking at date_added, we can see that it contains the month, date, and year that the film/show was added. Instead of having all this information in one column, why not try to separate them? That way, we can choose to isolate how month or year interacts with the other variables instead of looking at date_added where its granularity will make it difficult for any trend to be discovered.

Below, I’ve written code to not only separate the information into 2 other columns but also filtered out the rows with NaN values and replaced them with 0, just like what was done before with “”.

Now, the new dataset contains the month_added and year_added columns. This will allow us to do some trend analysis later.

Looking at duration, on top of it being a mixed type, there are also 2 different time units in this column. This is a problem because we are dealing with 2 different types of content that are measured differently for time. Thus, making graphs for durationwill be quite difficult to interpret if we keep them as it is. The good thing is that there are many ways to deal with this issue. The way I’ve chosen to deal with it is by separating the type of content into 2 different datasets and naturally, the duration column will just be numerical and just have 1 type of time unit. This way, you can easily and clearly plot using the values.

Because the duration column has both strings and numbers, I’ll also have to create a function to extract the number from that column so that it can be inserted into the columns of the 2 new datasets.

Step 7: Check the unique values of columns

Beyond potentially missing values, there could be corrupted values that you can run into once you perform analysis. To check this, we can check for unique values for some of the columns. Let’s refer to the first 5 rows of the datasets as our starting point.

It might not be strategic to check the unique values of all the columns, especially the title, director, and cast as there could be a large number of unique values to examine. Instead, let’s focus on a list of potential unique values that could be easier and more important to check given that it could be more insightful for future analysis. From a glimpse at the datasets, the columns country, rating, listed_in are probably the ones of interest. Let’s examine the rating column first as that seems to be the least complicated one to deal with.