One day a client ask me, look we have a set of users and they got some data, but they are not sure what question to ask of it.
Now you have to analyze the data but I am not going to tell you what to analyze.
This is not an uncomman question, Often we use to get this question, “here is a data do some magic”. Is there any science for this or is it just that some people are good with data and can find some intresting things with this.
Atleast This process has name and that is called Exploratory Data Analysis also called EDA
EDA (Getting a better understanding of data)
Exploratory data analysis is the first and foremost step to analyzing any kind of data. EDA is an approach, which seeks to explore the most important and often hidden pattern in the data set.
While doing data exploration we form a hypothesis, which can prove using the hypothesis testing technique. Statisticians called this a bird’s eye view of data and try to make some sense of it.
To solve the business problems we required clean historical data. The better the data, the more insight you can get. Better data means we required a clean, normalized, and standardized data set. Data comes from different sources, for example, Relational data sources like Oracle, MySql, Postgres, all other relational databases systems. Web, IOT data, Marketing, Sales, and the list goes on and on.
There are two main sources of data. Private and Public data source.
- Private Data: Private data is more sensitive in nature and available in the public domain. Banking, Retail, telecom, and media are some of the sectors which rely on data to make decisions. An organization leverage data analytics to become customer-centric.
- Public Data: Government and public agencies have collected a large amount of data for research purposes. Such data is not required any special permission for access. public data available on internet and various other platform. Any one can you these dataset for analysis purpose.
Data Cleaning Approach
There is various type of data quality issue that comes in data. Data Cleaning is the most time-consuming job in the analytics process.
Fix Rows and Column
Checklist for Fixing Rows
- Delete summary rows: for example we can delete the summary rows which have info about the total, subtotal rows.
- Incorrect rows: for example unnecessary header rows, footer rows can delete.
- Extra rows: for example We can delete column number indicators row, blank rows, page numbers, etc.
Checklist for Fixing Columns
- Merge columns for creating unique identifiers if needed: for example Merge State, City into Full address
- Split columns for more data: for example Split address to get State and City to analyze each separately
- Add column names: for example Add column names if missing
- Rename columns: Abbreviations, encoded columns
- Delete columns: Delete unnecessary columns
- Align misaligned columns: Dataset may have shifted columns.
Fix Missing Values
- Set values as missing values: for example Identify values that indicate missing data. and yet are not recognized by the software as such, e.g treat blank strings, “NA”, “XX”, “999”, etc. as missing.
- Delete rows, columns: for example you can delete the row if the number of missing values is very large in number. As this would not impact the analysis. The same goes with Columns, we can delete columns could if the missing values are large in number.
- Fill partial missing values using business judgment. for example, Missing time zone, century, etc. These values are identifiable.
- we should try to get information from reliable external sources as much as possible. but if you can’t, then it is better to keep missing values as such rather than magnifying the existing rows/columns.
- Standardize units: Ensure all observations under a variable have a common and consistent unit. for example convert lbs to kgs, miles/hr to km/hr, etc.
- Scale values if required: make sure the observations under a variable have a common scale.
- Standardize precision for better presentation of data, e.g. 4.5312341 kgs to 4.53 kgs.
- Remove outliers: Remove high and low values that would disproportionately affect the results of your analysis.
- Remove extra characters such as common prefix/suffix. leading/trailing/many spaces, etc. These are irrelevant to the analysis.
- Standardize case: There are various cases that string variables may take. For example UPPERCASE, lowercase, Title Case, Sentence case, etc.
- Standardize format: E.g. 23/10/16 to 2016/10/20, “Modi, Narendra” to “Narendra Modi”, etc.
Fix Invalid Values
- Encode Unicode correct: In case the data is being read as junk characters, try to change the encoding, E.g. CP1252 instead of UTF-8.
- Convert incorrect data types: Correct the incorrect data types to the correct data types for ease of analysis. For example, if numeric values contain strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are — string to number: “12,300” to “12300”; string to date: “2013-Aug” to “2013/08”; number to a string: “PIN Code 231304” to “231304”; etc.
- Correct values that go beyond range: If some of the values are beyond logical range, e.g. temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction, or if possible to remove this value.
- Correct values not in the list: Remove values that don’t belong to a list. for example, In a data set containing blood groups of individuals, strings “E” or “F” are invalid values and we can remove this.
- Correct wrong structure: Values that don’t follow a defined structure we can remove. for example. In a data set containing pin codes of Indian cities, a pin code of 12 digits would be an invalid value, which we can remove. The same goes with, a phone number of 12 digits would be an invalid value.
After fixing the missing values, standardized the existing values, and fixed the invalid values. We came to the last stage of data cleaning.
- Deduplicate data: Remove identical rows, remove rows where some columns are identical
- Filter rows: Filter by segment, filter by date period to get only the rows relevant to the analysis
- Filter columns: Pick columns relevant to the analysis
- Aggregate data: Group by required keys, aggregate the rest
So This brings us to our first conclusion on the Data Cleaning process. After attempting these processes on the data set that you need, the data is ready for analysis.
In the next section, we will understand the very important topic, set of rules to analyze the data
- Univariate Analysis
- Bivariate analysis
- Derive Metrics
OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more Machine Learning topics soon.