Data Cleaning With Python

In this article, therefore, we will discuss data cleaning entails and how you could clean noises(dirt) step by step by using Python.

What is Data Cleaning?

According to Wikipedia, is the process of detecting and correcting corrupt or inaccurate records from a record-set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data, and then replacing, modifying, or deleting the dirty or coarse data.

This definition is too big and certainly not easy to understand. To make it easier, we can see an example. Consider a scenario where a factory owner of Dairy Products is interested in knowing the frequent buyers of milk bottles to increase the customer base. But if the data is corrupted or has noise, then the decision will be misguided. In the below data, we have shown an example.

From the figure, we can depict that Data Cleaning is a technique which helps to convert improper data into meaningful data. In short, Machine Learning is data-driven. With data cleaning in place, your Machine Learning model will perform better. So, it is important to process data before use. Without quality data, it is foolish to expect a correct output.

How to perform Data Cleaning in Python?

For understanding, let’s take the example of a survey in which a company’s HR staff has to locate all of its area employees, and make sure they are safe at home. Before that, let’s understand that there are no Standard Data Cleaning Techniques. It is not possible to comment on which one is best. The only aspect which needs to be considered on cleaning methods depends on the nature of Data. This helps us choose which technique should be used.

We will get back to the example. To keep it simpler, we are looking at below fields.

Look at the table carefully. You’ll notice that certain fields are either blank or have irrelevant values. If we process such data, then our prediction will be in trouble. Thus, we will carry out the below steps for Data Cleaning. 

  1. Remove Repeated Row 
  2. Missing value treatment 
  3. Removal of Irrelevant Data 
  4. Manual error while typing 
  5. Renaming Columns 

As to process data, our first step would be to read data in Python. 

 

Observe the output table carefully, it’s the same table which we had in the first place.

The two important packages which we imported are Panda and Numpy. These are needed for a Python code to run. The next important thing is acronyms we adopted as a good practice. employee variable used to store data read from DataCleaning.csv file saved at the mentioned location. The command used to read data is read_csv and displayed on the screen by using print. For the fields having missing value, the system has filled it with NaN(Not a number). 

Remove Repeated Values

We know that there are duplicates in the dataset and that need to be removed. Row 5 and 7 have the same employee data.

 

We can delete the last row, and keep the first row.

 

Function drop_duplicates returns output with repeated rows removed. Below are the parameters used in a command. 

subset: We have assigned column name to subset parameters to check repeated values. By default, it takes all columns. 

keep: for keeping the first row(5) 

inplace: Boolean case, default false. By assigning true we asked the command to drop last value. 

Data with no repeated values.

 

Missing Value Treatment

In panda, missing data is represented in two ways. None or NaN. In our dataset, missing values are recognized as NaN. For checking missing values, we have used the function is.null().

The output of the command is boolean which is True for NaN values.

As shown in the output image, column Mobile no., Project and email have missing values. Missing values can either be filled or dropped.

In our example, we are working on employee data so filling it with any value will be inappropriate. Hence, we have dropped the missing values.

The command dropna drops rows/columns having at least null values in CSV file.

As shown in the output table, missing values are removed from the dataset.

Removal of irrelevant data

Sometimes, certain categories/columns in a dataset are not useful. In our case, column name, serial no are not important. Retaining it will take unnecessary space and consume time. 

Panda provides an easy command del to remove the unwanted column.

If we inspect the data we will see that the column is removed successfully.

Manual Error While Typing

In this step, we need to make sure the data has correct values whenever there are categories mentioned. In our case, the column Project has two possible values Client or Internal. But close observation of the output table can help us to point row 8 where the value is not following the possible case. (i.e, internal should be corrected to Internal). 

In python, we can replace the column with the corrected value.  

Renaming Columns

The dataset we are working on has a column name that does not speak about the importance of the data it holds. Thus, we will add some sensible labels by renaming the columns.

The output demonstrates the change in column name.

Recap to Data Cleaning

In this article, we have treated unnecessary information by using various steps. We have removed the duplicate rows, treated missing values, deleted irrelevant columns, and corrected the typing mistake. Lastly, the article also talks about renaming columns.

Matlabsolutions.com provides guaranteed satisfaction with a commitment to complete the work within time. Combined with our meticulous work ethics and extensive domain experience, We are the ideal partner for all your homework/assignment needs. We pledge to provide 24*7 support to dissolve all your academic doubts. We are composed of 300+ esteemed Matlab and other experts who have been empanelled after extensive research and quality check.

Matlabsolutions.com provides undivided attention to each Matlab assignment order with a methodical approach to solution. Our network span is not restricted to US, UK and Australia rather extends to countries like Singapore, Canada and UAE. Our Matlab assignment help services include Image Processing Assignments, Electrical Engineering Assignments, Matlab homework help, Matlab Research Paper help, Matlab Simulink help. Get your work done at the best price in industry.