I hope that this article serves as a starting point for you to learn how to clean your data efficiently to kickstart your personal projects.
Before even performing any cleaning or manipulation of your dataset, you should take a glimpse at your data to understand what variables you’re working with, how the values are structured based on the column they’re in, and maybe you could have a rough idea of the inconsistencies that you’ll need to address or they’ll be cumbersome in the analysis phase. Here, you might also be able to eliminate certain columns that you won’t need depending on the analysis you want to do.
Here, I printed the first 7 rows of my dataset, but you can print 5 or 10. I recommend keeping it to anything less than 10 or else it’ll be too overwhelming for what you’re currently trying to do–a quick glimpse of the dataset.
Doing this will give you a good idea of what data types you might be dealing with, what columns you need to perform transformations or cleaning, and other data you might be able to extract.
Before we look at this more closely, let’s perform the next step.
You want to do this to have easy access to the different columns of the dataset, especially when you want to perform the same transformations to different subsets of columns.
To stay organized, note the issues you see in your dataset (by taking a glimpse of your dataset like in Step 1).
This picture above represents what I can see just from glimpsing at the dataset and is something that you should think about when you’re looking at your dataset. Here are a few things that stand out to me:
date_added
and duration
. This can be a problem if we want to make time-series graphs by the date, or other plots to explore duration’s relationship with other variables.listed_in
(genre) or the actors on Netflix.From this code chunk, you can easily look at the distribution of missing values in the dataset to get a good idea of which columns you’ll need to work with to resolve the missing values issue.
From the output, these are insights you can gather:
director
column has the highest percentage of missing data ~ 30%cast
and country
column also has a considerable percentage of missing data ~ 9%date_added, rating
and duration
don’t have that much missing data ~ 0% - 0.1%Your next question is probably, how do I deal with these columns with missing values?
There are a few ways to deal with it:
director, cast
and country
columns are quite important to my analysis, I will keep them.Before I continue, I will bring up the issue of missing values across rows.
In some cases, you might want to examine the distribution of missing values across all the rows of your dataset (given that your dataset doesn’t have a large number of observations/rows). From here, you can choose from the choices above depending on how important the rows are to your analysis. For instance, your dataset contains recorded data of something that is changing over time. Even though a row can contain missing values, you might not want to eliminate it because there is important time information you want to retain.
Let’s continue to step 3 before I show you how to deal with the NaN values even after keeping the columns.
Here, you can see that all the columns have object
as their datatype aside from release_year
. In pandas, object means either string or mixed type (numerical and non-numerical type mixed). And from our dataset, you’ll be able to tell which columns are strictly string and mixed type.
After we know which data types we are dealing with, let’s make sure we remove any trailing characters and whitespace using strip
.
Referring back to the columns of missing values, let’s take a look at the columns: director, cast, country, date_added, rating, duration
. We can segment these columns by whether they are a string or mixed type.
String: director, cast, country, rating
(here, it’s a string and not mixed because the numerical values won’t have any meaning if separated)
Mixed: date_added, duration
NaN
means Not a Number in pandas. It is a special floating-point value that is different from NoneType
in Python. NaN
values can be annoying to work with, especially when you want to filter them out for plots or analysis. To make our lives easier, let’s replace these NaN values with something else.
For string type values, we can replace NaN
values with “” or “None” or any string that can indicate to you that there isn’t any value in that entry. Here, I chose to replace it with “” using the fillna
function. Because it’s not an in-place function, I reassigned the changed values to the column in the dataset.
Here, you must have noticed that I left out the duration column. This is because we’ll be doing something with that column later down the road.
For mixed-type values, before we tackle the missing value issue, let’s see if we can extract any data to make our analysis richer or process easier.
Looking at date_added
, we can see that it contains the month, date, and year that the film/show was added. Instead of having all this information in one column, why not try to separate them? That way, we can choose to isolate how month or year interacts with the other variables instead of looking at date_added
where its granularity will make it difficult for any trend to be discovered.
Below, I’ve written code to not only separate the information into 2 other columns but also filtered out the rows with NaN
values and replaced them with 0, just like what was done before with “”.
Now, the new dataset contains the month_added
and year_added
columns. This will allow us to do some trend analysis later.
Looking at duration
, on top of it being a mixed type, there are also 2 different time units in this column. This is a problem because we are dealing with 2 different types of content that are measured differently for time. Thus, making graphs for duration
will be quite difficult to interpret if we keep them as it is. The good thing is that there are many ways to deal with this issue. The way I’ve chosen to deal with it is by separating the type of content into 2 different datasets and naturally, the duration column will just be numerical and just have 1 type of time unit. This way, you can easily and clearly plot using the values.
Because the duration
column has both strings and numbers, I’ll also have to create a function to extract the number from that column so that it can be inserted into the columns of the 2 new datasets.
Beyond potentially missing values, there could be corrupted values that you can run into once you perform analysis. To check this, we can check for unique values for some of the columns. Let’s refer to the first 5 rows of the datasets as our starting point.
It might not be strategic to check the unique values of all the columns, especially the title, director, and cast as there could be a large number of unique values to examine. Instead, let’s focus on a list of potential unique values that could be easier and more important to check given that it could be more insightful for future analysis. From a glimpse at the datasets, the columns country, rating, listed_in
are probably the ones of interest. Let’s examine the rating column first as that seems to be the least complicated one to deal with.
Matlabsolutions.com provides guaranteed satisfaction with a
commitment to complete the work within time. Combined with our meticulous work ethics and extensive domain
experience, We are the ideal partner for all your homework/assignment needs. We pledge to provide 24*7 support
to dissolve all your academic doubts. We are composed of 300+ esteemed Matlab and other experts who have been
empanelled after extensive research and quality check.
Matlabsolutions.com provides undivided attention to each Matlab
assignment order with a methodical approach to solution. Our network span is not restricted to US, UK and Australia rather extends to countries like Singapore, Canada and UAE. Our Matlab assignment help services
include Image Processing Assignments, Electrical Engineering Assignments, Matlab homework help, Matlab Research Paper help, Matlab Simulink help. Get your work
done at the best price in industry.