Skip to content

SDSC 5002

EDA

Categorical Variable

  • The values of a categorical variable are labels of categories.

  • Categorical Variable -> Numerical Summary: by using the count or the percentage of individuals who fall in each category. 通过数量和百分比可以使得类变量拥有数值统计信息

  • Categorical Variable -> Graphical Summary:

    • bar chart
    • Pareto chart -> A sorted by frequency of bar chart
    • pie chart less useful if we want to compare actual counts

Numerical (Quantitative) Variables

  • The values of a numerical variable are numbers allowing arithmetic operations.
  • Graphical Summary
    • Histogram
      • Divide the range of the possible values into equal intervals.
      • histogram是紧接的,不要分开; barchart一般是分开的
    • Histogram and density curve
      • A density curve does not exactly mark the proportion of values in each range. -> smooth, continuous -> by kernel density estimation methods
  • Numerical Summary
    • Mean = average
    • Median = 50th percentile
    • Quartiles
      • Q1 first quartile (25th percentile)
      • Q3 third quartile (75th percentile)
      • IQR is Q3 - Q1
    • Variance and Standard Deviation (SD)

注意:仅使用箱线图或数值摘要(如均值、标准差)不足以完整描述分布的形状。尤其是对于具有复杂形状(如双峰分布)的数据,仅用这些统计量可能会掩盖一些重要特征。应结合直方图等图形方法来更好地展示数据的形状。

Caculate the quartiles

公式:R=(P/100)(N+1)

  • R: Percentile Rank 代表序号,用于表示取百分位数时的数据序号

  • P: Percentile 代表第P百分位数,经典值有25,50,75等

  • N: Number of Data 代表数据集里数字的个数

Shapes of a Distribution

image-2024102921639770 PMimage-2024102921702175 PM

boxplot

Box

  • Lower : first quartile (Q1)

  • Middle: Median (Md)

  • Upper: third quartile (Q3)

Whisker

  • Lower: the larger value between Q11.5IQR and minimum

  • Upper: the smaller value between Q3+1.5IQR and maximum

image-2024102931827023 PM

Relationship between two variables

  • Two categorical variables

    • Contingency Table
    • Joint distribution
      • Distribution of a single variable in a two-way table = marginal distribution
  • Two numerical variables

    • Scatterplot
    • Correlation (r)

Simpson’s paradox

其中趋势出现在几组数据中,但当这些组被合并后趋势消失或反转。

一所美国高校的两个学院,分别是法学院和商学院。新学期招生,人们怀疑这两个学院有性别歧视。现作如下统计,列出Contingency Table

法学院

性别录取拒收总数录取比例
男生8455315.1%
女生5110115233.6%
合计59146205

商学院

性别录取拒收总数录取比例
男生2015025180.1%
女生92910191.1%
合计29359352

根据上面两个表格来看,女生在两个学院都被优先录取,即女生的录取比率较。现在将两学院的数据汇总:

性别录取拒收总数录取比例
男生2099530468.8%
女生14311025356.5%
合计352205557

在总评中,女生的录取比率反而比男生

这个例子说明,简单的将分组数据相加汇总,是不能反映真实情况的

Correlation (r)

Measures the direction and strength of the linear relationship between two numerical variables

r(X,Y)=i=1n(xix¯)(yiy¯)i=1n(xix¯)2i=1n(yiy¯)2
  • between –1 and 1
    • Extremes$ r = -1$ and r=1 occur if and only if the points on a scatterplot lie exactly along a straight line
  • the symbol of r denote direction
  • the |r| denote relation strength |r| 越大代表点和线越紧密,反之越稀疏
  • r measures the strength of only the linear relationship, it does not describe curved relationships or Slope斜率
  • r has no units of measurement

Random Variables

The Sharpe Ratio

Sharpe(X)=μrfσ
  • μ and σ are the mean and SD of the return on the investment

  • rt stands for the return on a risk-free investment 无风险利率

Visualization and Data processing

Data Preprocessing

Major Tasks:

  • Data cleaning

    • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
    • Example: filling in missing values in blood test
  • Data integration

    • Integration of multiple databases, data cubes, or files
    • Example: integrating electronic health record with nursing home data
  • Data transformation

    • Normalization and aggregation
    • Example: age, weight, height, income
  • Data reduction

    • Obtains reduced representation in volume but produces the same or similar analytical results
    • Example: under-sample in tumor detection; embedding method in deep learning
  • Data discretization 离散化

    • Part of data reduction but with particular importance
    • Example: young, middle-age, elderly

Handle missing data

  • Ignore the tuple
  • Fill in the value
    • manually
    • constant
    • mean
    • most probable value

Imputing missing data by k-Nearest Neighbors

使用非缺失列算出距离,选定K个最近的点,这些点对应的缺失列平均值即为插值

Quiz4-KNNImputer 插值法

image-2024103054348866 PMimage-2024103054445498 PM

Handle noisy data

  • Binning method
    • first sort data and partition into (equal-depth) bins
    • then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
  • Clustering
    • detect and remove outliers
  • detect suspicious values 异常值 and check by human
  • Regression
    • smooth by fitting the data into regression functions

Data binning

image-2024103032447234 AM

Data transformation

Normalization

  • Min-max normalization

    v=vminAmaxAminA
  • Z-score normalization

    v=vmeanAstand_devA
  • Normalization by decimal scaling

    v=v10jwhere j is the smallest integer such that max(|v|)<1
  • Log transformation normalization

    v=log(v)

Discretization

Discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts.

离散化是将连续函数、模型、变量和方程转换为离散对应物的过程。

Methods

  • Binning
  • Histogram analysis
  • Clustering analysis

Supervised learning

Linear regression model

The simplest of all is a (simple) linear regression model; that is, the response or target y satisfies

y=β0+β1x+ϵ

in which

  • β0 is the intercept term
  • β1 is the slope
  • β0 and β1 are coefficients or parameters of the linear model
  • ϵ is a noise term, which is assumed to have mean zero
  • It is often assumed to be normally distributed in theoretical analysis.
  • Often, we assume the standard deviation of ϵ, denoted by σ, is unknown. σ is another parameter of the simple linear regression model.

Decomposition of the Equation

y^=β^0+β^1x¯+β^1(xx¯)
  • β^0+β^1x¯: Represents the average child height.
  • β^1(xx¯): Represents the "regression to the mean" effect, which shows the adjustment for parents who are taller or shorter than average

Fitting the regression model

minβ0,β1i=1n(yiy^i)2=i=1n(yiβ0β1xi)2β^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2andβ^0=y¯β^1x¯

Performance

Residual sum of squares(RSS)

RSS=i=1n(yiy^i)2

Total sum of squares(TSS)

TSS=i=1n(yiy¯)2

Residual standard error (RSE)

RSE=1np1RSS

where ( p = ) # of features ( ( p = 1 ) in simple linear regression).

R2 (a.k.a. coefficient of determination)

R2=TSSRSSTSS=1RSSTSS

R2 measures the proportion of variability in Y that can be explained using X. 度量了使用 X 可以解释的 Y 中变异性的比例。

  • R2 is close to 1 indicates that a large proportion of the variability in the response is explained by the regression.
  • R2 is close to 0 indicates that the regression does not explain much of the variability in the response (this might occur because the linear model is wrong, or the error variance σ2 is high, or both. )

当评估使用训练数据(即样本内 R2)且方法为最小二乘法时,R2 在 0 到 1 之间

Quiz

Given the regression equation

logy=1+3logx,

How is change in y associated with change in x?

Calculation:

y=exp(1+3logx)=ex3

If x increases by 1%:

ynew=e(1.01xold)3=e(1.013xold3)ynewyold=e(1.01xold)3exold3=1.013

Approximating,

(1.0131)0.03

Using By Python

There are five basic steps when you are implementing linear regression:

  1. Import the packages and classes you need.

  2. Provide data to work with and do appropriate transformations if necessary.

  3. Create a regression model and fit it with existing data. Check the results of model fitting to know whether the model is satisfactory.

  4. Apply the model for predictions.

Logistic regression

Sigmoid for Logistic regression model

f(x)=σ(x)=ex1+ex=P(Y=1|X=x)P(Y=0X=x)=1P(Y=1|X=x)=1ex1+ex=11+exlog(P(Y=1|X=x)P(Y=0|X=x))=β0+β1x

example

Suppose we collect data for a group of students in a statistics class with variables X1= hours studied, X2= undergrad GPA, and Y= receive an A. We fit a logistic regression and produce estimated coefficients β^0=6, β^1=0.05, β^2=1. (Take Y=1 to mean receive an A.)

  • Question 1: Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 will get an A in the class.
log(P(Y=1|X)1P(Y=1|X))=β0+β1X1+β2X2

要估计一个学生在学习40小时且GPA为3.5时获得A的概率(即P(Y=1|X1=40,X2=3.5)),我们可以将已知的系数代入公式:

log(P(Y=1|X1=40,X2=3.5)1P(Y=1|X1=40,X2=3.5))=6+0.05×40+1×3.5P(Y=1|X1=40,X2=3.5)=0.5
  • Question 2: How many hours would the student in the previous part need to study to have a 50% chance of getting an A in the class?

要求学生有50%的概率获得A,即P(Y=1|X)=0.5,同样带入公式即可求得X1

log(0.510.5)=6+0.05×X1+1×3.5X1=50

Linear discriminant analysis

Linear Discriminant Analysis (LDA) is a classification method that assumes the feature distribution for each class follows a Gaussian (normal) distribution with different means but the same covariance matrix.

X|(Y=0)N(μ0,Σ)

X|(Y=1)N(μ1,Σ)

LDA 决策边界的计算方法是,假设两个类别的方差相同,仅均值不同,从而最大化类别之间的分离。

Bayes’ Theorem

P(AB)=P(BA)P(A)P(B)

by using LDA,

P(Y=1X=x)=fXY=1(x)P(Y=1)fX(x)=fXY=1(x)P(Y=1)fXY=1(x)P(Y=1)+fXY=0(x)P(Y=0)

其中 f(x) 是对应的正态分布

Confusion matrix

his image represents a confusion matrix with the following labels:

Actual \ PredictedY^=1Y^=0
Y=1True Positive (TP)False Negative (FN)
Y=0False Positive (FP)True Negative (TN)
  • Type I error:

    FPR=FPFP+TN
  • Type II error:

    TPR=FNTP+FN
  • Specificity:= 1- type I error rate

Specificity=TNTN+FP
  • Power (a.k.a. sensitivity; a.k.a. recall) = 1-type II error rate
Sensitivity=TPTP+FN

ROC

Receiver Operating Characteristic is a graphical plot used to evaluate the performance of a binary classifier.

image-2024103053307633 PMimage-2024103053217237 PM

Accuracy vs Precision

  • Accuracy refers to the closeness of a measured value to a standard or known value.
  • Precision refers to the closeness of two or more measurements to each other.
image-2024103055927069 PM

example

Based on the following confusion matrix, compute specificity, sensitivity, and precision.

Y=1Y=0
Ŷ=12010
Ŷ=0425

True Positives (TP): 20

False Positives (FP): 10

False Negatives (FN): 4

True Negatives (TN): 25

1. Sensitivity (True Positive Rate)

Sensitivity=TPTP+FN=2020+4=2024=0.8333(or 83.33%)

2. Specificity (True Negative Rate)

Specificity=TNTN+FP=2525+10=2535=0.7143(or 71.43%)

3. Precision

Precision=TPTP+FP=2020+10=2030=0.6667(or 66.67%)

Functions in Python


处理missing data


计算百分位的方法


计算相关性的方法

首先需要筛选出可以计算相关性值

python
corr = df.select_dtypes('number').corr()

import seaborn as sns
sns.heatmap(corr, cmap="Blues", annot=True)

画density curve

  • seaborn.histplothttps://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn-histplot 如果我们需要画hist加上density curve,使用这个函数并加入kde=True
    • data: pandas.DataFrame, numpy.ndarray, mapping, or sequence
    • bins: str, number, vector, or a pair of such values Generic bin parameter that can be the name of a reference rule, the number of bins, or the breaks of the bins.
python
sns.histplot(df['medv'], kde=True,bins=10)
python
sns.kdeplot(df['medv'])
  • seaborn.displothttps://seaborn.pydata.org/generated/seaborn.displot.html#seaborn-displot 这是个通用的入口,可以指定图表类型,同时接受该图表对应的参数
    • data: pandas.DataFrame, numpy.ndarray, mapping, or sequence
    • kind{“hist”, “kde”, “ecdf”} Approach for visualizing the data. Selects the underlying plotting function and determines the additional set of valid parameters.