Introduction to statistics (Recap)

This is from datacamp learning

Statistics can tell you the possibility but it cannot answer everything since it is the prediction based on data and developed models.

Type of statistics

Descriptive statistics (explaining the past)

Describe the data or summarize the data
Say how many papers are being published in each year from each faculty or how many black and white cats around this neighborhood.

Inferential statistics (predicting the future)

Using the sampling approach to describe a larger population

Types of data

There are 2 main types of data

Numeric (quantitative)

Continuous (measured) -- temperature, car speed
Discrete (count) -- count on particular type of subject

Categorical (qualitative)

Nominal (unordered)

Active or Inactive

Ordinal (ordered)

Sequentially -- evaluation in survey (good, fair, poor)

Things need to be concern

Data type thus we can pick the right statistical tools to summarize and visualize the data

Measures of center

Mean, median, mode (the most frequent data)

Mean -- the most sensitive measurement to the most extreme values. Thus, it is suitable for use with symmetrical distributions.

Skewed data -- asymmetrical distribution, thus median is usually better use.

Left skewed

ถ้าลักษณะของตัวของมูลมีการกระจายตัวแบบไม่สมมาตรกัน ค่ามีนจะถูกดึงไปด้านที่เป็น extreme value ทำให้ค่ามีนเป้นตัว represent data ที่ไม่ค่อยดี จึงนิยมใช้ median ส่วนใหญ่จะเห็นได้บ่อย ๆ ในปเปอร์ที่เกียวข้องกับการทำ clinical trial -- เช่น ลักษณะการกระจายของอายุประชากร

Right skewed

ค่าอะไรที่เป็นตัวบ่งบอกว่าข้อมูลมีลักษณะสมมาตร กับไม่สมมาตร -- ความสัมพันธ์ของค่ามีนกับมีเดี่ยน

Mean = Median -- การกระจายของข้อมูลมีความสมมาตรกัน

Mean < Median -- การกระจายตัวของข้อมูล ความถี่จะไปสะสมอยู่ทางขวาเยอะ แต่ทางด้านซ้ายความถี่จะน้อย

ถ้าเอามา apply ดูกล่มอายุของประชากร ก็อาจจะบอกได้ว่า การกระจายตัวของกลุ่มประชากรที่เราเก็บตัวอย่างมานั้น มีลักษณะการกระจายตัวอย่างไร

Source: https://www.dummies.com/education/math/statistics/how-to-identify-skew-and-symmetry-in-a-statistical-histogram/

Measure of spread

การกระจายตัวของข้อมูล ซึ่งมีค่าหลายค่าที่บ่งบอกถึงการกระจายตัวของข้อมูล

Variance

ถ้าดูจากสูตรมันมาจากการจัดการกับตัวเลข เพราะมีการยกกำลัง 2 ดังนั้นเมื่อ visualize ในหัวอาจจะลำบากนิดนึง
Standard deviation

- ค่า variance ถอดสแควรูต

Mean absolute deviation
Quartile
แบ่งข้อมูลออกเป็น 4 ส่วนอย่างละเท่า ๆ กัน โดยส่วนใหญ่ boxplot จะใช้อันนี้
Interquartile range (IQR)
เป็นการวัดระยะห่างระหว่าง 25th กับ 75th (พูดง่าย ๆ คือ เป็นความสูงของตัว boxplot) -- ดูเหมือนว่าเป็นการดูการกระจุกตัวของข้อมูลบริเวณตรงการของ quartile จะเป็นค่าที่ ข้อมูล outlier ไม่มีผลต่อการดูค่าการกระจายตัวของข้อมูล แต่สามารถ นำมาหาค่า outlier ได้

Ref: https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/measures-of-spread/

Conclusion

Measure of spread

Variance
Standard deviation
Mean absolute deviation
Quartile
Interquartile range

Measure chance

Using possibility (0-100%) -- (impossible - possible)
To get the same result after randomly pick -- ในการส่งงานคอมให้สุ่มตัวเลข ต้องทำการ random seed นั่นหมายถึงการบอกให้คอมพิวเตอร์ซุ่มแบบนี้ ในประชากรกลุ่มนี้ เพื่อที่จะนำกลุ่มตัวอย่างไปคำนวน และได้ผลเหมือนเดิม คือ ทุก ๆ ครั้งที่หยิบมา ถ้า random seed number เหมือนกัน ผลของการคำนวนจากกลุ่ม sampling นั้นก็จะเหมือนกัน
การซุ่มตัวอย่างก็มีสองแบบ -- ค่าที่ได้ก็จะแตกต่างกัน ยกเว้นว่าตัวอย่างแม่งไม่มี diversity

Sampling w/o replacement

เลือกมาแล้วเอาออกเลย
การสุ่มเลือกแบบนี้ ค่าความน่าจะเป็นของการสุ่มครั้งหนึ่ง จะถูกกำหนดโดยลักษณะการถูกสุ่มก่อนหน้า dependent events

Sampling w replacement

เลือกมาแล้วเอากลับเข้ามาใหม่
การสุ่มเลือกแบบนี้ ค่าความน่าจะเป็นของการสุ่มครั้งหนึ่ง จะไม่ถูกกำหนดโดยการสุ่มก่อนหน้า (ไม่มีผลกระทบ) - independent events

แต่จะเลือกแบบไหน ก็ต้องทดลองทำดูว่าแบบไหน จะให้ possibility ที่ใกล้เคียงกับความเป็นจริง (น่าจะเป็นการสร้าง predictive model -- เช่น สุ่มเลือกแบบ sampling w replacement เอามาสร้าง predictive model น่าจะให้ผลค่าหนึ่ง แต่ถ้าเป็นการสุ่มเลือกแบบ sampling w/o replacment เอามาสร้างเป็น predictive ก็น่าจะได้อีกค่าหนึ่ง)

Discrete distribution -- การกระจายตัวของ possibility ของแต่ละเหตุการณ์

ใช้ำสำหรับ ตัวแปรที่สามารถนับได้ (countable variable)

Continuous distribution

เหมือนกับเรารอรถเมล์ ซึ่งตัวเวลามันจะมีความต่อเนื่อง ไม่ได้แยกเป็นก้อน ๆ

ลักษณะการกระจายตัวของ possibility จริง ๆ แล้วมันทีได้หลายแบบ ขึ้นอยู่กับลักษณะของตัวข้อมูล

The binomial distribution

ตัว possibility แล้วแต่ว่าลักษณะของเหตุการณ์ เช่น เหรียญจะออกหัวก้อย หรือ ถ้าด้านหัวหนักกว่า โอกาสที่จะออกหัวก็จะมากกว่า ดังนั้น ค่า possibility จะเปลี่ยนแปลงไป -- นั่นคือไม่เท่ากับ 50%

Area under the curve of possibility distribution = 1

ถ้าเรารู้ลักษณะการกระจายตัวของข้อมูล เราก็สามารถทำนายความเป็นไปได้ที่จะเกิดเหตุการณ์นี้ได้

การกระจายตัวมีหลายรูปแบบ และแต่ละรูปแบบก็จะมีทฤษฎิในการคำนวน possibility แตกต่างกันไป

Central limit theorem

ถึงแม้ว่าเราไม่สามารถ pick samples ขนาดใหญ่ได้ (จะได้ normal distribution) แต่การ pick up หลาย ๆ ครั้ง ก็จะได้ค่าใกล้เคียงกับการคำนวนได้จาก population กลุ่มที่ใหญ่ขึ้น

Genome editing technology short note

- August 13, 2016

Search This Blog

Random Records

Introduction to statistics (Recap) - part 1

Comments

Post a Comment

Most viewed blogs

Useful links (updated: 2026-01-29)

Genome editing technology short note

Umbrella vs Basket Trial