{"id":8850,"date":"2019-12-10T16:02:11","date_gmt":"2019-12-10T08:02:11","guid":{"rendered":"https:\/\/www.finereport.com\/en\/?p=8850"},"modified":"2025-04-15T20:22:32","modified_gmt":"2025-04-15T12:22:32","slug":"data-cleaning-guide-saving-80-of-your-time-to-do-data-analysis","status":"publish","type":"post","link":"https:\/\/www.finereport.com\/en\/data-analysis\/data-cleaning-guide-saving-80-of-your-time-to-do-data-analysis.html","title":{"rendered":"Data Cleaning Example: Save 80% Time on Data Analysis"},"content":{"rendered":"<div id=\"toc_container\" class=\"toc_transparent no_bullets\"><p class=\"toc_title\">Contents<\/p><ul class=\"toc_list\"><li><a href=\"#Why_We_Need_Data_Cleaning\">Why We Need Data Cleaning\uff1f<\/a><\/li><li><a href=\"#Data_Cleaning_Example_Data_Quality_Guidelines\">Data Cleaning Example: Data Quality Guidelines<\/a><\/li><li><a href=\"#Data_Cleaning_Example_in_Python\">Data Cleaning Example in Python<\/a><ul><li><a href=\"#1_Completeness\">1. Completeness<\/a><\/li><li><a href=\"#2Comprehensiveness\">2.Comprehensiveness<\/a><\/li><li><a href=\"#3Legality\">3.Legality<\/a><\/li><li><a href=\"#4Uniqueness\">4.Uniqueness<\/a><\/li><\/ul><\/li><li><a href=\"#Get_in_the_Habit_of_Data_Cleaning\">Get in the Habit of Data Cleaning<\/a><ul><li><a href=\"#You_might_also_be_interested_in\">You might also be interested in\u2026<\/a><\/li><\/ul><\/li><\/ul><\/div>\n<h2><span id=\"Why_We_Need_Data_Cleaning\"><strong>Why We Need Data Cleaning\uff1f<\/strong><\/span><\/h2>\n<p>Data analysis is a time-consuming task, but are you prepared before the data analysis, and have you omitted the important step: data cleaning?<\/p>\n<p><a href=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/01.jpg\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-8851\" src=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/01.jpg\" alt=\"\" width=\"737\" height=\"309\" srcset=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/01.jpg 737w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/01-300x126.jpg 300w\" sizes=\"(max-width: 737px) 100vw, 737px\" \/><\/a><\/p>\n<p style=\"text-align: center;\">From Google<\/p>\n<p>In the process of data analysis, data cleaning is such a preliminary preparation after <a href=\"https:\/\/www.finereport.com\/en\/data-analysis\/four-basic-ways-to-automate-data-extraction.html\">data extraction<\/a>. For data scientists, we will encounter all kinds of data. Before analyzing, we need to invest a lot of time and energy to &#8220;organize and trim&#8221; the data to the way we want or need.<\/p>\n<p>Why? Because the data we collect often has many problems.<\/p>\n<p>Let&#8217;s look at an example first. Suppose the boss gives you the following data to require you to do data analysis. How do you feel after seeing this data?<\/p>\n<p><a href=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/02-1.png\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-8853\" src=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/02-1.png\" alt=\"\" width=\"708\" height=\"441\" srcset=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/02-1.png 708w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/02-1-300x187.png 300w\" sizes=\"(max-width: 708px) 100vw, 708px\" \/><\/a><\/p>\n<p>The data you just saw may be confusing because the data is missing annotations.<\/p>\n<p>When we collect and organize data, we must mark the data. The data header is very important. For example, this data sheet lacks column names. In this way, we do not know the meaning of each column of data, we cannot understand the role of these values from the business meaning, and whether these values are correct.<\/p>\n<p>Let me briefly explain what these numbers represent.<\/p>\n<p>This is the membership data of a clothing store. The top row is column coordinates and the leftmost column is row coordinates.<\/p>\n<p>In column coordinates, column 0 represents the serial number, column 1 represents the name of the member, column 2 represents the age, column 3 represents the weight, columns 4-6 represent the measurements of the male members, and columns 7-9 represent the measurements for female members.<\/p>\n<p>After understanding the meaning, let&#8217;s look at the specific data in the middle part. There are many problems with these dirty data. Many of the values are NaN and there are blank lines.<\/p>\n<p>This is only part of the membership data for a store. The data business in daily work will be much more complicated. Generally, we need to count more data dimensions, such as 100 indicators. The data volume usually exceeds the TB and EB levels, so the processing difficulty of the entire data analysis increases exponentially.\u00a0 At this time, it is difficult to find the problem with the naked eye.<\/p>\n<p>I&#8217;ve given such a simple example to show you why it is important to prepare for data cleaning before data analysis. Experienced data analysts know that a good data analyst must be a data cleaning expert. You must know that in the entire data analysis process, whether it is time or effort, data cleaning accounts for about 80%.<\/p>\n<h2><span id=\"Data_Cleaning_Example_Data_Quality_Guidelines\"><strong>Data Cleaning Example: Data Quality Guidelines<\/strong><\/span><\/h2>\n<p>In the above case of clothing store membership data, you can definitely find a few problems when you look at the data. Do you want to know if there are some guidelines to regulate the quality of these data?<\/p>\n<p>Here, I summarize the data cleaning rules into the following 4 key points:<\/p>\n<ol>\n<li><strong>Completeness<\/strong>: Whether there is a null value for a single piece of data and whether the statistical fields are complete.<\/li>\n<li><strong>Comprehensiveness<\/strong>: Observe all the values \u200b\u200bin a column. For example, in an Excel table, we can select a column, and you can see the average, maximum, and minimum values \u200b\u200bof the column. We can use common sense to determine if there is a problem with the column, such as data definition, unit identification, value itself.<\/li>\n<li><strong>Legality<\/strong>: The legality of the type, content and size of the data. For example, there are non-ASCII characters in the data, the gender is unknown, and the age is over 150.<\/li>\n<li><strong>Uniqueness<\/strong>: Whether the data has duplicate records, because the data usually comes from the summary of different channels, and duplicates are common. Both row data and column data need to be unique. For example, a person cannot repeatedly record multiple times, and a person&#8217;s weight cannot be repeatedly recorded multiple times in the column index.<\/li>\n<\/ol>\n<p>In accordance with the above principles, we can solve most of the problems encountered in data cleaning, making the data standard, clean, and continuous, preparing for subsequent data statistics and data mining. If you want to further optimize the data quality, you also need to use it flexibly in real cases.<\/p>\n<h2><span id=\"Data_Cleaning_Example_in_Python\"><strong>Data Cleaning Example in Python<\/strong><\/span><\/h2>\n<p><a href=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1.png\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-8855\" src=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1.png\" alt=\"\" width=\"1374\" height=\"772\" srcset=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1.png 750w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1-300x169.png 300w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1-768x432.png 768w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/04-1-1024x575.png 1024w\" sizes=\"(max-width: 1374px) 100vw, 1374px\" \/><\/a><\/p>\n<p>After understanding the data quality guidelines, let&#8217;s solve the problems in the above data case.<\/p>\n<p>Here you need Python&#8217;s <strong>Pandas<\/strong>. We introduced this tool <a href=\"https:\/\/medium.com\/@lyric09220\/python-getting-started-tutorial-scientific-calculation-with-pandas-17b2b38be6fc\">before<\/a>. It is a NumPy-based tool created specifically to solve data analysis tasks. <strong>Pandas<\/strong> includes a large number of libraries, and we can use these libraries to efficiently clean up data.<\/p>\n<p>Python has a wealth of libraries and can be called a good data mining helper. Of course, there are many other tools for data cleaning. Here we just take Pandas as an example to help you apply data cleaning guidelines and take you to a more intuitive understanding of how data cleaning is going on.<\/p>\n<p>Below, we use <strong>Pandas<\/strong> for cleaning in accordance with the above four guidelines.<\/p>\n<h3><span id=\"1_Completeness\"><strong>1. Completeness<\/strong><\/span><\/h3>\n<p><strong>Question 1: Missing values<\/strong><\/p>\n<p>Some age and weight values are missing in the table. This is often due to a large amount of data. In the process, some values were not collected. Generally, we can use the following three methods:<\/p>\n<p>Delete: delete records with missing data;<\/p>\n<p>Mean: use the mean of the current column;<\/p>\n<p>High Frequency: Uses the most frequently occurring data in the current column.<\/p>\n<p>For example, if we want to fill the missing values in <em>df [\u2018Age\u2019]<\/em> with the average age, we can write:<\/p>\n<blockquote><p><em>df[&#8216;Age&#8217;].fillna(df[&#8216;Age&#8217;].mean(), inplace=True)<\/em><\/p><\/blockquote>\n<p>If we fill with the most frequently occurring data, we can first obtain the highest frequency <em>age_maxf<\/em> of the <strong>Age<\/strong> field through <em>value_counts<\/em>, and then fill the missing data in the <strong>Age<\/strong> field with <em>age_maxf<\/em>:<\/p>\n<blockquote><p><em>age_maxf = train_features[&#8216;Age&#8217;].value_counts().index[0] <\/em><\/p>\n<p><em>train_features[&#8216;Age&#8217;].fillna(age_maxf, inplace=True)<\/em><\/p><\/blockquote>\n<p><strong>Question 2: Blank lines<\/strong><\/p>\n<p>We found an empty row in the table, except for exception, all values are <strong>NaN<\/strong>. <em>read_csv ()<\/em> in <strong>Pandas<\/strong> does not have optional parameters to ignore blank lines, so we need to use <em>dropna ()<\/em> to process the data after it has been read in to delete blank lines.<em>\u00a0<\/em><\/p>\n<blockquote><p><em>df.dropna(how=&#8217;all&#8217;,inplace=True)<\/em><\/p><\/blockquote>\n<h3><span id=\"2Comprehensiveness\"><strong>2.Comprehensiveness<\/strong><\/span><\/h3>\n<p><strong>Problem: Units of column data are not uniform<\/strong><\/p>\n<p>Observing the values in the <strong>weight<\/strong> column, we can find that the units of the <strong>weight<\/strong> column are not uniform. Some units are <strong>kgs<\/strong> and some units are <strong>lbs<\/strong>.<\/p>\n<p>Here I use <strong>kgs<\/strong> as a uniform unit of measurement to convert <strong>lbs<\/strong> to <strong>kgs<\/strong>:<\/p>\n<blockquote><p><em># Get lbs data in weight column<\/em><\/p>\n<p><em>rows_with_lbs = df [&#8216;weight&#8217;]. str.contains (&#8216;lbs&#8217;). fillna (False) print df [rows_with_lbs]<\/em><\/p>\n<p><em># Convert lbs to kgs, 2.2lbs = 1kgs<\/em><\/p>\n<p><em>for i, lbs_row in df [rows_with_lbs].iterrows ():<\/em><\/p>\n<p><em># Truncate from the beginning to the third character before the last lbs. <\/em><\/p>\n<p><em>weight = int (float (lbs_row [&#8216;weight&#8217;] [:-3]) \/ 2.2) df.at [i, &#8216;weight&#8217;] = &#8216;() kgs&#8217;.format (weight)<\/em><\/p><\/blockquote>\n<h3><span id=\"3Legality\"><strong>3.Legality<\/strong><\/span><\/h3>\n<p><strong>Problem: Non-ASCII characters<\/strong><\/p>\n<p>We see that <strong>Firstname<\/strong> and <strong>Lastname<\/strong> have some <strong>non-ASCII<\/strong> characters in the dataset. We can use the delete or replace method to solve the <strong>non-ASCII<\/strong> problem. Here we use the delete method:<\/p>\n<blockquote><p><em># Remove non-ASCII characters<\/em><\/p>\n<p><em>df [&#8216;first_name&#8217;]. replace ({r &#8216;[^ \\ x00- \\ x7F] +&#8217;: &#8221;}, regex = True, inplace = True)<\/em><\/p>\n<p><em>df [&#8216;last_name&#8217;]. replace ({r &#8216;[^ \\ x00- \\ x7F] +&#8217;: &#8221;}, regex = True, inplace = True)<\/em><\/p><\/blockquote>\n<h3><\/h3>\n<h3><span id=\"4Uniqueness\"><strong>4.Uniqueness<\/strong><\/span><\/h3>\n<p><strong>Question 1: Multiple parameters in a column<\/strong><\/p>\n<p>It is not difficult to find in the table that the <strong>Name<\/strong> column contains the two parameters <strong>Firstname<\/strong> and <strong>Lastname<\/strong>. For data cleanliness purposes, we split the <strong>Name<\/strong> column into <strong>Firstname<\/strong> and <strong>Lastname<\/strong> fields. We use Python&#8217;s <strong>split<\/strong> method, <em>str.split (expand = True)<\/em>, to split the list into new columns, and then delete the original <strong>Name<\/strong> column.<\/p>\n<p>Let&#8217;s check if there are duplicate records in the data. If duplicate records exist, use <em>drop_duplicates ()<\/em> provided by <strong>Pandas<\/strong> to remove duplicate data.<\/p>\n<blockquote><p><em>df[[&#8216;first_name&#8217;,&#8217;last_name&#8217;]] = df[&#8216;name&#8217;].str.split(expand=True) <\/em><\/p>\n<p><em>df.drop(&#8216;name&#8217;, axis=1, inplace=True)<\/em><\/p><\/blockquote>\n<p><strong>Question 2: Duplicate data<\/strong><\/p>\n<blockquote><p><em># Delete duplicate rows<\/em><\/p>\n<p><em>df.drop_duplicates([&#8216;first_name&#8217;,&#8217;last_name&#8217;],inplace=True)<\/em><\/p><\/blockquote>\n<p>In this way, we will clean up the membership data in the above case. Let\u2019s see the results of the cleaned data. How is it? Is it clean and standard?<\/p>\n<p><a href=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/03.png\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-8852\" src=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/03.png\" alt=\"\" width=\"762\" height=\"210\" srcset=\"https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/03.png 762w, https:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/12\/03-300x83.png 300w\" sizes=\"(max-width: 762px) 100vw, 762px\" \/><\/a><\/p>\n<h2><span id=\"Get_in_the_Habit_of_Data_Cleaning\"><strong>Get in the Habit of Data Cleaning<\/strong><\/span><\/h2>\n<p>Now, can you feel the data problem is not trivial, there are 6 errors in the simple example above. That\u2019s why we often say that real-world data is &#8220;dirty&#8221; and needs to be cleaned.<\/p>\n<p>The data of third parties should be cleaned, and the data of own products also need data cleaning. It can be said that without high-quality data, there is no high-quality data mining, and data cleaning is a guarantee for high-quality data.<\/p>\n<p>When you&#8217;re new to data science, you will definitely find data mining to be cool and valuable. This is true, but today I want to tell you that even cool things can\u2019t be separated from basic work, like the data cleaning work we talked about today.<\/p>\n<h3 class=\"graf graf--h3\"><span id=\"You_might_also_be_interested_in\">You might also be interested in\u2026<\/span><\/h3>\n<p class=\"entry-title\"><a href=\"https:\/\/www.finereport.com\/en\/data-analysis\/data-analysis-practice-guide-how-to-begin.html\">Data Analysis Practice Guide\u2014\u2014How to begin<\/a><\/p>\n<p><a href=\"https:\/\/www.finereport.com\/en\/data-analysis\/what-data-analysis-tools-should-i-learn-to-start-a-career-as-a-data-analyst.html\">What Data Analysis Tools Should I Learn to Start a Career as a Data Analyst?<\/a><\/p>\n<p class=\"entry-title\"><a href=\"https:\/\/www.finereport.com\/en\/data-analysis\/6-key-skills-that-data-analysts-need-to-master.html\">6 Key Skills That Data Analysts Need to\u00a0Master<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Using Pandas in Python to complete data cleaning, improve your efficiency in data analysis.<\/p>\n","protected":false},"author":1,"featured_media":8851,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[161],"tags":[151],"yst_prominent_words":[2742,2734,2732,2726,2700,1035,2727,291,646,484,2714,2716,2743,2722,2711,2741,2733,2731,2735],"_links":{"self":[{"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/posts\/8850"}],"collection":[{"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/comments?post=8850"}],"version-history":[{"count":5,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/posts\/8850\/revisions"}],"predecessor-version":[{"id":19548,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/posts\/8850\/revisions\/19548"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/media\/8851"}],"wp:attachment":[{"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/media?parent=8850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/categories?post=8850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/tags?post=8850"},{"taxonomy":"yst_prominent_words","embeddable":true,"href":"https:\/\/www.finereport.com\/en\/wp-json\/wp\/v2\/yst_prominent_words?post=8850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}