Statistics is the cornerstone of data analysis. After learning statistics, you will find that many times the analysis is not reliable. For example, many people like to use averages to analyze the results of a thing, but this is often rough and inaccurate. If you learn statistics, then we can look at the data from a more scientific perspective.

Most of the data analysis will use the following knowledge of statistics, you can focus on:

  • Basic statistics: Mean, Median, Mode, Variance, Standard Deviation, Percentile, etc.
  • Probability Distribution: Geometric Distribution, Binomial Distribution, Poisson Distribution, Normal Distribution, etc.
  • Population and Sample: understanding the basic concepts, the concept of sampling
  • Confidence Interval and Hypothesis Testing: How to Perform Validation Analysis
  • Correlation and Regression Analysis: Basic Models for General Data Analysis
  • With basic statistics, you can perform more diverse visualizations for more granular data analysis. And you also need to learn more Excel functions to achieve basic calculations, or some corresponding visualization methods in Python and R.

With the concept of population and sample, you will know how to do sample analysis in the face of large-scale data. You can also apply a hypothesis test to make more precise tests of some perceptual assumptions. Using regression analysis, you can make basic predictions about some future data and missing data.

After understanding the principle of statistics, you may not be able to implement it through tools. Then you need to find the relevant implementation method on the Internet, or read some books. 

In addition, you can grasp the principles of some popular algorithms, such as linear regression, logistic regression, decision tree, neural network, correlation analysis, clustering, collaborative filtering, random forest, etc. Going a little deeper, you can also master related algorithms such as text analysis, deep learning, and image recognition. With regard to these algorithms, you need to not only understand the principles, but also explain them fluently. You also need to know some of the application scenarios in various industries. If they are not the must in your current job, it may not be the focus.

This article is a summary of knowledge points.

Summary:

  1. Concentration trend
  2. Variability
  3. Normalization
  4. Normal distribution
  5. Sampling distribution
  6. Estimate
  7. Hypothesis testing
  8. T test

1. The Concentration Trend

1.1 Mode

The number which appears most often in a set of numbers. 

1.2 Median

The median of a finite list of numbers can be found by arranging all the numbers from smallest to greatest.

If there is an odd number of numbers, the middle one is picked. If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values.

1.3 Average

A calculated “central” value of a set of numbers. To calculate it: add up all the numbers, then divide by how many numbers there are.

Average is familiar to most of you, but sometimes the average value will be greatly affected by the appearance of certain extremes. For example, there are 20 people in your class. Everyone has a similar income. 19 people are around 5,000. However, one student has successfully started a business and has an annual income of 100 million. At this time, the average income in your class is 5 million. At this time, the “median” is more reasonable, reflecting the real situation;

2. Variability

2.1 Quartile

We have just mentioned ‘median’ above, divide the sample into 2 parts, and then find the ‘median’ of the 2 parts respectively. The sample was divided into 4 parts, and the value of 1/4 was recorded as Q1. The value at 2/4 is recorded as Q2, and the value at 3/4 is recorded as Q3.

2.2 Interquartile Range IQR=Q3-Q1

Interquartile Range IQR=Q3-Q1

2.3 Outliers

Smaller than Q1-1.5 (IQR) or greater than Q3+1.5 (IQR);

For outliers, we have to eliminate them in the data processing.

2.4 Variance

Variance

2.5 Square Deviation

Arithmetic square root of variance

2.6. Bezier Correction: Correct Sample Variance

Actually, when calculating the variance, the denominator uses n-1 instead of the number n of samples. The reason is that, for example, in a Gaussian distribution, we extract a part of the sample and use the variance of the sample to represent the variance of the large sample data set that satisfies the Gaussian distribution. Since the sample mainly falls near the x=u center value, if the sample is calculated by the following formula, the prediction variance must be smaller than the variance of the big data set (because the data extracted by the edge of the Gaussian distribution is also small). In order to make up for this shortcoming, we change the formula n to n-1 to increase the variance value. This method is called Bessel correction coefficient.

3. Normalization

3.1 Standard Score

The major purpose of standard scores is to place scores for any individual on any variable having any mean and standard deviation on the same standard scale so that comparisons can be made. Without some standard scale, comparisons across individuals and/or across variables would be difficult to make (Lomax,2001, p. 68). In other words, a standard score is another way to comapre a student’s performance to that of the standardization sample. A standard score (or scaled score) is calculated by taking the raw score and transforming it to a common scale. A standard score is based on a normal distrbution with a mean and a standard deviation (see Figure 1). The black line at the center of the distribution represents the mean. The turquoise lines represent standard deviations.

4. The Normal Distribution

Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

The normal distribution is sometimes informally called the bell curve

 

 

The red curve is the standard normal distribution

Many things closely follow a Normal Distribution:

  • heights of people
  • size of things produced by machines
  • errors in measurements
  • blood pressure
  • marks on a test

5. Sampling Distribution

5.1 Central Limit Theorem

The Central Limit Theorem (CLT) is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Check Understanding The Central Limit Theorem for more you want to know about CLT.

5.2 Sampling Distribution

A sampling distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.

6. Estimate

6.1 Error Bound

Error Bound

6.2 Confidence

6.3 Confidence Interval

Confidence Interval

7. Hypothesis Testing

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true.

8. T-test

8.1 Introduction

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features. It is mostly used when the data sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population. 

You can check Investopedia’s T-Test Definition for more knowledge here.

8.2 Independent Sample T-test:

The main difference between analyzing whether the height of boys and girls is the same is the source of the data and the problem to be analyzed.

Independent Sample T-test

8.3 Paired Sample T-Test

To figure out whether a man’s height is different in the morning and evening, I found some people to measure their height in the morning and evening. Everyone here has two values. Here, there is a match.

Paired Sample T-Test

Sample Error

Sample Error
Sample Error

8.4 Pooled Variance

When the average number of samples is different, but the variance is actually considered to be the same, the variance needs to be combined.

Don’t be scared by the formula, its essence is the weighted average of the two sample variances

Pooled Variance
Pooled Variance

8.5 Cohen’s d

Effect size: In statistics, an effect size is a quantitative measure of the magnitude of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient in a regression, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive. For most types of effect size, a larger absolute value always indicates a stronger effect, with the main exception being if the effect size is an odds ratio.

Cohen’s d
Cohen’s d

What statistical technique do you use most to analyze data? Let me know in the comments below!

Follow FineReport Reporting Software on facebook to master data science together!

References

Wikipedia-Median

Mathsisfun-Definitions

Investopedia-Sampling Distribution

MathWorld-Hypothesis Testing

Wikipedia-Effect Size

Wikipedia-Statistical Hypothesis Testing