A primer for first semester module based on fundamentals of statistics taught for Master of Health Economics, Master of Health Service Management at University of Kelaniya
Module Information
| Course | MHEC Master of Health Economics |
|---|---|
| Module | Basic Statistics |
| Module Code | MHEC 51023 |
| Referencing Style | APA7 |
Aim
Develop fundamental statistical knowledge and analytical skills to summarize, interpret, and evaluate data in health-related contexts using statistical methods and software.
Learning Outcomes to be assessed
- Classify data into appropriate measurement types.
- Present data using relevant tables, graphical displays, and summary statistics, quantify uncertainty in study results.
- Formulate research hypotheses into a statistical context in public health studies.
- Estimate quantities of interest and evaluate hypotheses with appropriate statistical methods.
- Accurately interpret statistical methods and results reported in health publications.
- Analyse data using a specific software package.
Module Content
Introduction to Statistics: Classifying health data; summarizing data using simple statistical methods and graphical presentation; Measures of central tendency; Measures of dispersion; Introduction to Probability: terms and concepts, definition, theorem, happening of events; probability under conditions of statistical independence & dependence; Marginal, conditional and joint probability; Sampling Techniques: Random sampling, stratified sampling, cluster sampling; Sampling Distribution: Distribution of sample statistics and its importance in estimation; Probability and non-probability sampling; Confidence Intervals for population means, proportions, and other parameters.
Suggested Reading
Broyles, R. W. (2006). Fundamentals of statistics in health administration. Jones and Bartlett.
Heiman, G. W. (2003). Basic statistics for the behavioral sciences (4th ed). Houghton Mifflin.
Hill, A. B. (1984). A short textbook of of medical statistics (11. ed). Hodder and Stoughton.
Kirkwood, B. R., Sterne, J. A. C., & Kirkwood, B. R. (2003). Essential medical statistics (2nd ed). Blackwell Science.
Matthews, D. E., & Farewell, V. T. (2007). Using and understanding medical statistics (4th, completely rev. and enl. ed edns). Karger.
Rosner, B. (2016). Fundamentals of biostatistics (8th edition). Cengage Learning.
Sprinthall, R. C. (2012). Basic statistical analysis (9th ed). Pearson Allyn & Bacon.
Zotero Collection
@book{broylesFundamentalsStatisticsHealth2006,
title = {Fundamentals of Statistics in Health Administration},
author = {Broyles, Robert W.},
year = 2006,
publisher = {{Jones and Bartlett}},
address = {Sudbury, Mass},
isbn = {978-0-7637-4556-1},
lccn = {RA409 .B847 2006},
keywords = {Health services administration,Medical statistics,Microsoft Excel (Computer file),Public Health Administration,Statistical methods,Statistics}
}
@book{heimanBasicStatisticsBehavioral2003,
title = {Basic Statistics for the Behavioral Sciences},
author = {Heiman, Gary W.},
year = 2003,
edition = {4th ed},
publisher = {Houghton Mifflin},
address = {Boston},
isbn = {978-0-618-22017-5},
lccn = {QA276.12 .H45 2003},
keywords = {Statistics}
}
@book{hillShortTextbookMedical1984,
title = {A Short Textbook of of Medical Statistics},
author = {Hill, Austin Bradford},
year = 1984,
series = {University Medical Texts},
edition = {11. ed},
publisher = {{Hodder and Stoughton}},
address = {London},
isbn = {978-0-340-34742-3},
langid = {english}
}
@book{kirkwoodEssentialMedicalStatistics2003,
title = {Essential Medical Statistics},
author = {Kirkwood, Betty R. and Sterne, Jonathan A. C. and Kirkwood, Betty R.},
year = 2003,
edition = {2nd ed},
publisher = {Blackwell Science},
address = {Malden, Mass},
isbn = {978-0-86542-871-3},
lccn = {R853.S7 K497 2003},
keywords = {Biometry,Medical statistics,Medicine,Research Statistical methods,Statistics}
}
@book{matthewsUsingUnderstandingMedical2007,
title = {Using and Understanding Medical Statistics},
author = {Matthews, David E. and Farewell, Vernon T.},
year = 2007,
edition = {4th, completely rev. and enl. ed},
publisher = {Karger},
address = {Basel ; New York},
isbn = {978-3-8055-8189-9},
lccn = {RA409 .M39 2007},
keywords = {Biometry,Medical statistics,methods,Statistics}
}
@book{rosnerFundamentalsBiostatistics2016,
title = {Fundamentals of Biostatistics},
author = {Rosner, Bernard},
year = 2016,
edition = {8th edition},
publisher = {Cengage Learning},
address = {Boston, MA},
isbn = {978-1-305-26892-0},
lccn = {QH323.5 .R674 2016},
keywords = {Biometry,Medical statistics,Textbooks}
}
@book{sprinthallBasicStatisticalAnalysis2012,
title = {Basic Statistical Analysis},
author = {Sprinthall, Richard C.},
year = 2012,
edition = {9th ed},
publisher = {Pearson Allyn \& Bacon},
address = {Boston},
isbn = {978-0-205-05217-2},
lccn = {HA29 .S658 2012},
keywords = {Social sciences,Statistical methods,Statistics}
}
Copy the above code and paste in Zotero.
Assignment Template
Download Basic Statistics Assignment Template.
Assignment & Model Answers
Probability Theory
(a) What do you meant by simple probability? Explain using own example from real life probability.
(b) One card is drawn from a standard pack of 52 cards. Calculate the probability that the card will
- (i) be a Jack
- (ii) not be an Ace
(c) Consider the experiment of rolling a die. Let X be the event “getting a prime number”, Y be the event “getting an odd number”. Write the sets representing the events:
- (i) X or Y (i.e. X ∪ Y)
- (ii) X and Y (i.e. X ∩ Y)
- (iii) X but not Y
- (iv) Not X
(d) When two dice are rolled, find the probability of getting a greater number on the first die than the one on the second, given that the sum should be equal to 9.
(e) The blood groups of 200 people is distributed as follows: 50 have type A blood, 65 have B blood type, 70 have O blood type and 15 have type AB blood. If a person from this group is selected at random,
- (i) what is the probability that this person has O blood type?
- (ii) what is the probability that this person has AB blood type?
(f) What do you understand by addition theorem of probability? Explain using mutually not exclusive two events.
(g) Explain what do you meant by General Multiplication Law in probability theory.
(h) Assume that a bag contains 5 white and 3 black balls. Two balls are drawn at random one after the other without replacement. Find the probability that both balls drawn are black.
(i) A manufacturing company produces TV sets in three plants with daily production of 500, 1000 and 2000 units respectively. According to past experience, it is known that the fraction of defective output produced by the three plants are respectively 0.005, 0.008, 0.010. If a TV is selected from a day’s total production and found to be defective, find the probability that it has come from the first plant.
(j) Suppose a certain disease has an incidence rate of 0.1% (that is, it afflicts 0.1% of the population). A test has been devised to detect this disease. The test does not produce false negatives (anyone who has the disease will test positive), but the false positive rate is 5% (about 5% of people who take the test will test positive even though they do not have the disease). Suppose a randomly selected person takes the test and tests positive. What is the probability that this person actually has the disease?
(k) Calculate the probability that a patient has an illness given a positive test result for the illness. A positive test result means the test thinks the patient has the illness.
(iii) The test returns a positive result 7% of the time for people who do not have the illness.
(i) 8% of the population has the illness.
(ii) The test returns a positive result 95% of the time for patients who have the illness.
Model Answer
(a) Simple Probability
Simple probability refers to the likelihood of a single event occurring. It is calculated by dividing the number of favorable outcomes by the total number of possible outcomes (Ross, 2014). The formula is:
P(Event) = Number of favorable outcomes / Total number of possible outcomes
This section will explain simple probability using a real-life example from Sri Lanka.
Example from Sri Lanka:
Consider buying a lottery ticket for the Sri Lankan Development Lottery. If 1,000,000 tickets are sold and you buy one ticket, the simple probability of your ticket winning the first prize is:
P(Winning) = 1/1,000,000 = 0.000001 or 0.0001%
This calculation shows that simple probability deals with one specific outcome (your ticket winning) from all possible outcomes (all tickets sold).
(b) Card Probability Calculations
This section calculates probabilities for drawing specific cards from a standard deck.
A standard deck contains 52 cards. The deck has 4 Jacks and 4 Aces.
(i) Probability of drawing a Jack:
P(Jack) = Number of Jacks / Total cards P(Jack) = 4/52 = 1/13 = 0.0769 or 7.69%
(ii) Probability of not drawing an Ace:
First, find the probability of drawing an Ace: P(Ace) = 4/52 = 1/13
Then use the complement rule: P(Not Ace) = 1 – P(Ace) P(Not Ace) = 1 – 1/13 = 12/13 = 0.9231 or 92.31%
(c) Set Operations with Dice Events
This section examines different set operations when rolling a die.
When rolling a die, the sample space is Ω = {1, 2, 3, 4, 5, 6}.
Event X (prime numbers) = {2, 3, 5} Event Y (odd numbers) = {1, 3, 5}
(i) X or Y (X ∪ Y):
The union includes all outcomes in either event. X ∪ Y = {1, 2, 3, 5}
(ii) X and Y (X ∩ Y):
The intersection includes only outcomes in both events. X ∩ Y = {3, 5}
(iii) X but not Y:
This includes outcomes in X that are not in Y. X but not Y = {2}
(iv) Not X:
The complement of X includes all outcomes not in X. Not X = {1, 4, 6}
(d) Conditional Probability with Two Dice
This section calculates a conditional probability for rolling two dice.
The question asks for P(First die > Second die | Sum = 9).
First, identify all outcomes where the sum equals 9:
- (3,6), (4,5), (5,4), (6,3)
There are 4 outcomes where the sum is 9.
Next, identify outcomes where the first die is greater AND the sum is 9:
- (5,4) and (6,3)
There are 2 favorable outcomes.
Therefore: P(First > Second | Sum = 9) = 2/4 = 1/2 = 0.5 or 50%
(e) Blood Group Probability
This section analyzes blood group distribution probabilities. This example could represent a blood donation camp in Colombo, Sri Lanka.
Total people = 200
- Type A: 50
- Type B: 65
- Type O: 70
- Type AB: 15
(i) Probability of O blood type:
P(O) = 70/200 = 0.35 or 35%
(ii) Probability of AB blood type:
P(AB) = 15/200 = 0.075 or 7.5%
(f) Addition Theorem of Probability
The addition theorem of probability helps calculate the probability of either of two events occurring (Walpole et al., 2012). This section explains the theorem using events that are not mutually exclusive.
General Addition Rule:
For two events A and B that can occur together:
P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
The subtraction of P(A ∩ B) is necessary because outcomes in both events would be counted twice otherwise.
Example from Sri Lanka:
Consider students at the University of Colombo. Let event A be “student studies Engineering” and event B be “student plays cricket.” These events are not mutually exclusive because some engineering students also play cricket.
Suppose:
- P(Engineering) = 0.30
- P(Cricket) = 0.25
- P(Both Engineering and Cricket) = 0.10
Then: P(Engineering or Cricket) = 0.30 + 0.25 – 0.10 = 0.45 or 45%
This calculation demonstrates why the intersection must be subtracted to avoid double counting students who both study engineering and play cricket.
(g) General Multiplication Law
The General Multiplication Law calculates the probability of two events occurring together (Montgomery and Runger, 2014). This section explains the law and its application.
Formula:
P(A ∩ B) = P(A) × P(B|A)
Where:
- P(A ∩ B) is the probability of both events occurring
- P(A) is the probability of event A
- P(B|A) is the probability of event B given that event A has occurred
The law applies to both independent and dependent events. For independent events, P(B|A) = P(B), which simplifies the formula to P(A ∩ B) = P(A) × P(B).
For dependent events, the occurrence of the first event affects the probability of the second event. This relationship makes the conditional probability P(B|A) different from P(B).
(h) Probability Without Replacement
This section calculates the probability of drawing two black balls without replacement.
The bag contains:
- 5 white balls
- 3 black balls
- Total: 8 balls
For the first draw: P(First black) = 3/8
For the second draw (given first was black): P(Second black | First black) = 2/7
This probability changes because one black ball was removed, leaving 2 black balls and 7 total balls.
Using the multiplication rule: P(Both black) = P(First black) × P(Second black | First black) P(Both black) = (3/8) × (2/7) = 6/56 = 3/28 = 0.1071 or 10.71%
(i) Bayes’ Theorem: Manufacturing Plants
This section applies Bayes’ Theorem to determine which plant produced a defective TV. This could represent a manufacturing scenario in the Biyagama Export Processing Zone in Sri Lanka.
Given information:
- Plant 1: 500 units/day, 0.5% defective
- Plant 2: 1000 units/day, 0.8% defective
- Plant 3: 2000 units/day, 1.0% defective
- Total production: 3500 units/day
Calculation steps:
First, calculate the probability of each plant:
- P(Plant 1) = 500/3500 = 1/7
- P(Plant 2) = 1000/3500 = 2/7
- P(Plant 3) = 2000/3500 = 4/7
Next, calculate the probability of a defective TV from each plant:
- P(Defective | Plant 1) = 0.005
- P(Defective | Plant 2) = 0.008
- P(Defective | Plant 3) = 0.010
Then, calculate the total probability of a defective TV: P(Defective) = P(Plant 1) × P(Def|P1) + P(Plant 2) × P(Def|P2) + P(Plant 3) × P(Def|P3) P(Defective) = (1/7)(0.005) + (2/7)(0.008) + (4/7)(0.010) P(Defective) = 0.000714 + 0.002286 + 0.005714 = 0.008714
Finally, apply Bayes’ Theorem: P(Plant 1 | Defective) = [P(Plant 1) × P(Def|P1)] / P(Defective) P(Plant 1 | Defective) = 0.000714 / 0.008714 = 0.0819 or 8.19%
The probability that the defective TV came from Plant 1 is 8.19%.
(j) Bayes’ Theorem: Disease Testing
This section uses Bayes’ Theorem to interpret a positive test result.
Given information:
- Disease incidence: 0.1% or 0.001
- Test sensitivity: 100% (no false negatives)
- False positive rate: 5% or 0.05
Calculation steps:
Define events:
- P(Disease) = 0.001
- P(No Disease) = 0.999
- P(Positive | Disease) = 1.0
- P(Positive | No Disease) = 0.05
Calculate total probability of testing positive: P(Positive) = P(Disease) × P(Pos|Disease) + P(No Disease) × P(Pos|No Disease) P(Positive) = (0.001)(1.0) + (0.999)(0.05) P(Positive) = 0.001 + 0.04995 = 0.05095
Apply Bayes’ Theorem: P(Disease | Positive) = [P(Disease) × P(Pos|Disease)] / P(Positive) P(Disease | Positive) = 0.001 / 0.05095 = 0.0196 or 1.96%
Despite testing positive, the probability of actually having the disease is only 1.96%. This low probability occurs because the disease is very rare and the false positive rate is relatively high.
(k) Bayes’ Theorem: Illness Diagnosis
This section calculates the probability of having an illness given a positive test result.
Given information:
- P(Illness) = 8% or 0.08
- P(Positive | Illness) = 95% or 0.95
- P(Positive | No Illness) = 7% or 0.07
Calculation steps:
First, determine P(No Illness): P(No Illness) = 1 – 0.08 = 0.92
Next, calculate the total probability of testing positive: P(Positive) = P(Illness) × P(Pos|Illness) + P(No Illness) × P(Pos|No Illness) P(Positive) = (0.08)(0.95) + (0.92)(0.07) P(Positive) = 0.076 + 0.0644 = 0.1404
Finally, apply Bayes’ Theorem: P(Illness | Positive) = [P(Illness) × P(Pos|Illness)] / P(Positive) P(Illness | Positive) = 0.076 / 0.1404 = 0.5413 or 54.13%
The probability that a patient has the illness given a positive test result is 54.13%. This means that even with a positive test, there is still a significant chance (approximately 46%) that the patient does not have the illness.
Random Variable and Probability Distribution
(a) Suppose three coins are tossed. Let Y be the random variable representing the number of tails. Find:
- (i) Values of the random variable Y and show them in a table.
- (ii) Find expected value and variance of random variable Y.
(b) A pair of fair dice is rolled. Let X denote the sum of the number of dots on the top faces.
(ii) Find expected value and standard deviation of random variable X.
(i) Values of the random variable X and show them in a table.
Model Answer
Question (a): Three Coins Tossed
(i) Values of Random Variable Y
When three coins are tossed, there are eight possible outcomes. The table below shows all outcomes and the number of tails in each outcome.
Table 1: Possible Outcomes When Three Coins Are Tossed
| Outcome | Number of Tails (Y) |
|---|---|
| HHH | 0 |
| HHT | 1 |
| HTH | 1 |
| HTT | 2 |
| THH | 1 |
| THT | 2 |
| TTH | 2 |
| TTT | 3 |
The random variable Y can take four different values: 0, 1, 2, or 3 tails. The next table shows the probability for each value.
Table 2: Probability Distribution of Y
| Y (Number of Tails) | Number of Outcomes | Probability P(Y) |
|---|---|---|
| 0 | 1 | 1/8 = 0.125 |
| 1 | 3 | 3/8 = 0.375 |
| 2 | 3 | 3/8 = 0.375 |
| 3 | 1 | 1/8 = 0.125 |
| Total | 8 | 1.000 |
According to the handout, a discrete random variable must satisfy two conditions (Ranathilaka, 2025). First, all probabilities must be between 0 and 1. Second, all probabilities must sum to 1. This distribution meets both conditions.
(ii) Expected Value and Variance of Y
The expected value shows the average result if the experiment is repeated many times. The formula for expected value is:
E(Y) = Σ [Y × P(Y)]
Table 3: Calculation of Expected Value
| Y | P(Y) | Y × P(Y) |
|---|---|---|
| 0 | 1/8 | 0 |
| 1 | 3/8 | 3/8 = 0.375 |
| 2 | 3/8 | 6/8 = 0.750 |
| 3 | 1/8 | 3/8 = 0.375 |
| Total | 1.0 | 1.5 |
E(Y) = 1.5 tails
This result makes sense. On average, half of three coins will show tails.
The variance measures how spread out the values are. The formula is (Ranathilaka, 2025):
Var(Y) = E(Y²) – [E(Y)]²
First, E(Y²) must be calculated.
Table 4: Calculation of E(Y²)
| Y | Y² | P(Y) | Y² × P(Y) |
|---|---|---|---|
| 0 | 0 | 1/8 | 0 |
| 1 | 1 | 3/8 | 3/8 = 0.375 |
| 2 | 4 | 3/8 | 12/8 = 1.500 |
| 3 | 9 | 1/8 | 9/8 = 1.125 |
| Total | 1.0 | 3.0 |
E(Y²) = 3.0
Now the variance can be calculated:
Var(Y) = 3.0 – (1.5)² = 3.0 – 2.25 = 0.75
Variance = 0.75
Question (b): Pair of Fair Dice Rolled
(i) Values of Random Variable X
When two dice are rolled, the sum X can range from 2 to 12. The handout explains that for two dice, there are 2⁶ = 64 possible outcomes… wait, that’s incorrect. Let me recalculate.
Actually, for two dice, there are 6 × 6 = 36 possible outcomes (Ranathilaka, 2025). Each die shows a number from 1 to 6.
Table 5: Probability Distribution of X (Sum of Two Dice)
| X (Sum) | Ways to Get This Sum | Probability P(X) |
|---|---|---|
| 2 | (1,1) = 1 way | 1/36 = 0.028 |
| 3 | (1,2), (2,1) = 2 ways | 2/36 = 0.056 |
| 4 | (1,3), (2,2), (3,1) = 3 ways | 3/36 = 0.083 |
| 5 | (1,4), (2,3), (3,2), (4,1) = 4 ways | 4/36 = 0.111 |
| 6 | (1,5), (2,4), (3,3), (4,2), (5,1) = 5 ways | 5/36 = 0.139 |
| 7 | (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6 ways | 6/36 = 0.167 |
| 8 | (2,6), (3,5), (4,4), (5,3), (6,2) = 5 ways | 5/36 = 0.139 |
| 9 | (3,6), (4,5), (5,4), (6,3) = 4 ways | 4/36 = 0.111 |
| 10 | (4,6), (5,5), (6,4) = 3 ways | 3/36 = 0.083 |
| 11 | (5,6), (6,5) = 2 ways | 2/36 = 0.056 |
| 12 | (6,6) = 1 way | 1/36 = 0.028 |
| Total | 36 ways | 1.000 |
The table shows that 7 is the most likely sum. This is because there are more ways to get 7 than any other number.
(ii) Expected Value and Standard Deviation of X
Table 6: Calculation of Expected Value
| X | P(X) | X × P(X) |
|---|---|---|
| 2 | 1/36 | 2/36 = 0.056 |
| 3 | 2/36 | 6/36 = 0.167 |
| 4 | 3/36 | 12/36 = 0.333 |
| 5 | 4/36 | 20/36 = 0.556 |
| 6 | 5/36 | 30/36 = 0.833 |
| 7 | 6/36 | 42/36 = 1.167 |
| 8 | 5/36 | 40/36 = 1.111 |
| 9 | 4/36 | 36/36 = 1.000 |
| 10 | 3/36 | 30/36 = 0.833 |
| 11 | 2/36 | 22/36 = 0.611 |
| 12 | 1/36 | 12/36 = 0.333 |
| Total | 1.0 | 7.0 |
E(X) = 7.0
This result is logical. The expected sum is 7, which is in the middle of the range from 2 to 12.
For standard deviation, the variance must be calculated first.
Table 7: Calculation of E(X²)
| X | X² | P(X) | X² × P(X) |
|---|---|---|---|
| 2 | 4 | 1/36 | 4/36 = 0.111 |
| 3 | 9 | 2/36 | 18/36 = 0.500 |
| 4 | 16 | 3/36 | 48/36 = 1.333 |
| 5 | 25 | 4/36 | 100/36 = 2.778 |
| 6 | 36 | 5/36 | 180/36 = 5.000 |
| 7 | 49 | 6/36 | 294/36 = 8.167 |
| 8 | 64 | 5/36 | 320/36 = 8.889 |
| 9 | 81 | 4/36 | 324/36 = 9.000 |
| 10 | 100 | 3/36 | 300/36 = 8.333 |
| 11 | 121 | 2/36 | 242/36 = 6.722 |
| 12 | 144 | 1/36 | 144/36 = 4.000 |
| Total | 1.0 | 54.833 |
E(X²) = 54.833
Var(X) = E(X²) – [E(X)]² = 54.833 – (7)² = 54.833 – 49 = 5.833
The standard deviation is the square root of variance (Ranathilaka, 2025):
SD(X) = √Var(X) = √5.833 = 2.415
Standard Deviation = 2.415
Discrete Probability Distribution
(a) The mean and the standard deviation of a binomial distribution are 45 and 6 respectively. Calculate n, p and q.
(b) The probability of students getting university admission from a school is 0.6. If 5 students are studying in the school; What is the probability that at least one will get university admission from this school?
(c) Suppose 23.5% of individuals in a given patient population are HIV positive. If 16 individuals are selected from this patient population, what are the chances that at least eight of these individuals are HIV positive?
(d) Suppose it is known that 5% of adults who take a certain medication experience negative side effects. Find the probability that more than a certain number of patients in a random sample of 100 will experience negative side effects.
(e) In a book with 500 pages, there are spelling errors in a total of 50 pages. 25 pages are randomly taken from that book and tested whether there is spelling errors. What is the probability of having errors in at least two pages?
(f) A box contains 100 fruits. Of these, if 2% of the fruits are damaged, what is the probability that at most 5 fruits were damaged?
(g) The hospital had 3000 deliveries each year… (Poisson).
- (i) On how many days in the year would 5 or more deliveries be expected?
- (ii) Over the course of one year, what is the greatest number of deliveries expected in any night?
Model Answer
(a) Finding n, p, and q from Mean and Standard Deviation
Given Information:
- Mean (μ) = 45
- Standard deviation (σ) = 6
Solution:
For a binomial distribution, the mean and variance have specific formulas (Walpole et al., 2012). The mean equals np, and the variance equals npq.
First, the variance is calculated from the standard deviation:
- Variance = σ² = 6² = 36
Second, two equations are set up:
- Mean: np = 45
- Variance: npq = 36
Third, q is found by dividing the variance by the mean:
- q = npq/np = 36/45 = 0.8
Fourth, p is calculated because p + q = 1:
- p = 1 – 0.8 = 0.2
Finally, n is found:
- n = 45/0.2 = 225
Answer: n = 225, p = 0.2, q = 0.8
(b) University Admission Probability
Given Information:
- Number of students (n) = 5
- Probability of admission (p) = 0.6
- Need to find: P(at least one gets admission)
Solution:
This is a binomial distribution problem (Ross, 2014). The phrase “at least one” means one or more students get admission.
The complement rule is used here. It is easier to calculate the probability that no students get admission, then subtract from 1.
P(at least one) = 1 – P(none) P(X ≥ 1) = 1 – P(X = 0)
The binomial formula is applied: P(X = 0) = C(5,0) × (0.6)⁰ × (0.4)⁵ P(X = 0) = 1 × 1 × 0.01024 P(X = 0) = 0.01024
Therefore: P(X ≥ 1) = 1 – 0.01024 = 0.98976
Answer: The probability is 0.9898 or approximately 99%. This means almost all groups of 5 students will have at least one student getting admission.
(c) HIV Positive Individuals
Given Information:
- Proportion HIV positive (p) = 0.235 or 23.5%
- Sample size (n) = 16 individuals
- Need to find: P(at least 8 are HIV positive)
Solution:
This follows a binomial distribution (Devore, 2015). The calculation requires finding P(X ≥ 8).
P(X ≥ 8) = P(X=8) + P(X=9) + P(X=10) + … + P(X=16)
Each probability is calculated using the binomial formula: P(X = x) = C(16,x) × (0.235)ˣ × (0.765)⁽¹⁶⁻ˣ⁾
Calculating each term:
- P(X=8) = C(16,8) × (0.235)⁸ × (0.765)⁸ = 0.0348
- P(X=9) = C(16,9) × (0.235)⁹ × (0.765)⁷ = 0.0142
- P(X=10) = C(16,10) × (0.235)¹⁰ × (0.765)⁶ = 0.0047
- P(X=11) = 0.0012
- P(X=12) = 0.0003
- P(X=13) to P(X=16) ≈ 0.0001 (very small)
Sum = 0.0348 + 0.0142 + 0.0047 + 0.0012 + 0.0003 + 0.0001 ≈ 0.0553
Answer: The probability is approximately 0.0553 or 5.53%. This is a relatively low probability. It shows that having 8 or more HIV positive individuals in a sample of 16 is uncommon when the population rate is 23.5%.
(d) Medication Side Effects
Note: The question states “more than a certain number” but does not specify the number. I will demonstrate with “more than 8 patients” as an example.
Given Information:
- Sample size (n) = 100 patients
- Probability of side effects (p) = 0.05 or 5%
- Need to find: P(X > 8)
Solution:
This is a binomial distribution problem (Montgomery and Runger, 2018). However, because n is large, calculation becomes difficult.
P(X > 8) = 1 – P(X ≤ 8) P(X ≤ 8) = P(X=0) + P(X=1) + P(X=2) + … + P(X=8)
Using binomial calculations or statistical software: P(X ≤ 8) ≈ 0.9369
Therefore: P(X > 8) = 1 – 0.9369 = 0.0631
Answer: The probability that more than 8 patients experience side effects is approximately 0.0631 or 6.31%. This example shows that when the side effect rate is 5%, having many patients with side effects is relatively unlikely.
Sri Lankan Context: This type of calculation is important for the Ministry of Health in Sri Lanka when introducing new medications. It helps predict how many patients might need additional medical support.
(e) Spelling Errors in Book Pages
Given Information:
- Total pages = 500
- Pages with errors = 50
- Sample size = 25 pages
- Need to find: P(at least 2 pages have errors)
Solution:
This can be solved using binomial distribution (Mann, 2016). The probability that a randomly selected page has an error is: p = 50/500 = 0.1
Parameters are:
- n = 25
- p = 0.1
- q = 0.9
P(at least 2) = 1 – P(less than 2) P(X ≥ 2) = 1 – [P(X=0) + P(X=1)]
Calculations: P(X=0) = C(25,0) × (0.1)⁰ × (0.9)²⁵ = 0.0718 P(X=1) = C(25,1) × (0.1)¹ × (0.9)²⁴ = 0.1994
P(X ≥ 2) = 1 – (0.0718 + 0.1994) = 1 – 0.2712 = 0.7288
Answer: The probability is 0.7288 or approximately 72.88%. This means there is a high chance of finding errors in at least two pages when 25 pages are checked.
Sri Lankan Context: This type of analysis is useful for publishers in Sri Lanka, such as Educational Publications Department, when doing quality control checks on textbooks.
(f) Damaged Fruits
Given Information:
- Total fruits (n) = 100
- Proportion damaged (p) = 0.02 or 2%
- Need to find: P(at most 5 damaged)
Solution:
When n is large and p is small, the Poisson distribution can approximate the binomial distribution (Triola, 2017). This makes calculations easier.
The mean number of damaged fruits is: λ = np = 100 × 0.02 = 2
P(X ≤ 5) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4) + P(X=5)
Using Poisson formula: P(X=x) = (e⁻λ × λˣ) / x!
Calculations:
- P(X=0) = (e⁻² × 2⁰) / 0! = 0.1353
- P(X=1) = (e⁻² × 2¹) / 1! = 0.2707
- P(X=2) = (e⁻² × 2²) / 2! = 0.2707
- P(X=3) = (e⁻² × 2³) / 3! = 0.1804
- P(X=4) = (e⁻² × 2⁴) / 4! = 0.0902
- P(X=5) = (e⁻² × 2⁵) / 5! = 0.0361
Sum = 0.1353 + 0.2707 + 0.2707 + 0.1804 + 0.0902 + 0.0361 = 0.9834
Answer: The probability is 0.9834 or 98.34%. This shows that it is very likely that 5 or fewer fruits are damaged when the damage rate is 2%.
Sri Lankan Context: Fruit exporters in Sri Lanka, such as those exporting mangoes or pineapples, use this type of analysis for quality control before shipping.
(g) Hospital Deliveries
Note: The question is incomplete. It should specify 3000 deliveries per year, but part (ii) is unclear. I will answer part (i) and provide guidance for part (ii).
Given Information:
- Total deliveries per year = 3000
- Days per year = 365
Solution for part (i):
This is a Poisson distribution problem (Hogg et al., 2019). First, the average rate per day is calculated:
λ = 3000/365 = 8.22 deliveries per day
The question asks: On how many days would 5 or more deliveries be expected?
First, find P(X ≥ 5): P(X ≥ 5) = 1 – P(X < 5) P(X < 5) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)
Using λ = 8.22:
- P(X=0) = e⁻⁸·²² = 0.00027
- P(X=1) = (e⁻⁸·²² × 8.22) / 1! = 0.00222
- P(X=2) = (e⁻⁸·²² × 8.22²) / 2! = 0.00912
- P(X=3) = (e⁻⁸·²² × 8.22³) / 3! = 0.02498
- P(X=4) = (e⁻⁸·²² × 8.22⁴) / 4! = 0.05132
P(X < 5) = 0.00027 + 0.00222 + 0.00912 + 0.02498 + 0.05132 = 0.08791
P(X ≥ 5) = 1 – 0.08791 = 0.91209
Number of days = 365 × 0.91209 ≈ 333 days
Answer (i): On approximately 333 days in the year, 5 or more deliveries would be expected.
Solution for part (ii):
The question “greatest number of deliveries expected in any night” needs clarification. Two interpretations are possible:
If asking about the mode (most likely number): The mode of a Poisson distribution is approximately equal to λ. Therefore, 8 deliveries per night is most expected.
If asking about maximum capacity planning: Hospitals typically plan for the 95th or 99th percentile. Using λ = 8.22 and Poisson tables or software, the 95th percentile is approximately 13 deliveries, and the 99th percentile is approximately 16 deliveries.
Continuous Probability Distribution
(a) State the name and important characteristics of the following distributions:
- (i) X ~ Bin(12, 0.5)
- (ii) X ~ N(36, 16)
(b) What do you mean by standard normal distribution? Explain.
(c) A manufacturer of metal pistons finds that on the average, 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain
- (i) no more than 2 rejects?
- (ii) at least 2 rejects?
(d) Use the standard normal table to find the following values (Z is a standard normal random variable):
- (i) P(Z < 1.5)
- (ii) P(−1.5 < Z < 1.5)
- (iii) P(Z > 1.046)
- (iv) P(0 < Z < 1.05)
- (v) P(|Z| ≤ 1.86)
(e) An automobile battery manufacturer claims that its midgrade battery has a mean life of 50 months with a standard deviation of 6 months (approximately normal).
- (i) Find P(battery lasts less than 48 months).
- (ii) Find P(mean of a random sample of 36 batteries is less than 48 months).
(f) Marks of the student of a particular class distributed normally with mean 60 and variance 100. Pass marks = 40 and A grade = 75. What is the probability that the student do not pass the exam?
(g) The average weekly income of MBA graduates one year after graduation is 600(SD=100). What is the probability that 25 randomly selected graduates have an average weekly income of less than $550?
(h) Sample of 400 students are selected from a particular class which have the average height of 67.39 inch and variance of 1.69
- (i) What is the distribution of sample mean
- (ii) How many students have the average height greater than 67.5 inch in this sample?
Model Answer
Question (a): Distribution Names and Characteristics
(i) X ~ Bin(12, 0.5)
Name: Binomial Distribution
This distribution is called the binomial distribution. The binomial distribution is a discrete probability distribution. It counts the number of successes in a fixed number of independent trials (OpenStax, 2023). Each trial has only two outcomes: success or failure.
Important Characteristics:
1. The distribution has two parameters: n = 12 (number of trials) and p = 0.5 (probability of success).
2. The number of trials is fixed at n = 12.
3. Each trial is independent. The outcome of one trial does not affect other trials.
4. The probability of success remains constant at p = 0.5 for each trial.
5. The mean (expected value) is μ = np = 12 × 0.5 = 6.
6. The variance is σ² = np(1-p) = 12 × 0.5 × 0.5 = 3.
7. The standard deviation is σ = √3 = 1.732.
8. Since p = 0.5, this distribution is symmetric around the mean.
Example from Sri Lanka: This distribution can be used to study the number of heads when tossing a Sri Lankan rupee coin 12 times. The probability of getting heads is 0.5 for each toss.
(ii) X ~ N(36, 16)
Name: Normal Distribution
This distribution is called the normal distribution. The normal distribution is a continuous probability distribution. It has a bell-shaped curve that is symmetric around the mean (Bhandari, 2023). The normal distribution is also known as the Gaussian distribution.
Important Characteristics:
1. The distribution has two parameters: μ = 36 (mean) and σ² = 16 (variance).
2. The standard deviation is σ = √16 = 4.
3. The curve is symmetric around the mean. The mean, median, and mode are all equal to 36.
4. The curve has a bell shape. The highest point is at the center (at μ = 36).
5. About 68% of values fall within one standard deviation of the mean (between 32 and 40).
6. About 95% of values fall within two standard deviations of the mean (between 28 and 44).
7. About 99.7% of values fall within three standard deviations of the mean (between 24 and 48).
8. The curve approaches the x-axis but never touches it. The curve extends to positive and negative infinity.
Example from Sri Lanka: This distribution can be used to study the mathematics marks of students in a Sri Lankan school. If the average mark is 36 and the variance is 16, then the marks follow this normal distribution.
Question (b): Standard Normal Distribution
The standard normal distribution is a special type of normal distribution. It has a mean of 0 and a standard deviation of 1 (Bhandari, 2020). The notation for this distribution is Z ~ N(0, 1).
Key Features of Standard Normal Distribution:
1. The mean (μ) is always 0.
2. The standard deviation (σ) is always 1.
3. The variance (σ²) is always 1.
4. The curve is symmetric around zero.
5. The total area under the curve equals 1.
Z-Score Transformation:
Any normal distribution can be converted to a standard normal distribution. This process is called standardization. The formula for standardization is:
Z = (X – μ) / σ
Where:
• Z is the standard normal variable (Z-score)
• X is the original value
• μ is the mean of the original distribution
• σ is the standard deviation of the original distribution
Why Use Standard Normal Distribution:
1. It makes calculations easier. Researchers can use standard normal tables (Z-tables) to find probabilities.
2. It allows comparison between different distributions. Different datasets can be compared after standardization.
3. It is useful for hypothesis testing. Many statistical tests use the Z-distribution.
Example from Sri Lanka: If a student scores 75 in mathematics and the class mean is 60 with standard deviation 10, the Z-score is Z = (75 – 60) / 10 = 1.5. This means the student scored 1.5 standard deviations above the mean.
Question (c): Metal Pistons Problem
This problem uses the binomial distribution. The manufacturer rejects 12% of pistons. This means the probability of rejection is p = 0.12. The batch contains n = 10 pistons.
The distribution is X ~ Bin(10, 0.12).
The binomial probability formula is:
P(X = x) = C(n,x) × p^x × (1-p)^(n-x)
Where C(n,x) = n! / (x! × (n-x)!)
(i) Probability of No More Than 2 Rejects
This means P(X ≤ 2). This is calculated by adding the probabilities:
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2)
Calculate P(X = 0):
P(X = 0) = C(10,0) × (0.12)^0 × (0.88)^10
P(X = 0) = 1 × 1 × 0.2785 = 0.2785
Calculate P(X = 1):
P(X = 1) = C(10,1) × (0.12)^1 × (0.88)^9
P(X = 1) = 10 × 0.12 × 0.3165 = 0.3798
Calculate P(X = 2):
P(X = 2) = C(10,2) × (0.12)^2 × (0.88)^8
P(X = 2) = 45 × 0.0144 × 0.3596 = 0.2330
Total probability:
P(X ≤ 2) = 0.2785 + 0.3798 + 0.2330 = 0.8913
Answer: The probability that a batch contains no more than 2 rejects is 0.8913 or 89.13%.
(ii) Probability of At Least 2 Rejects
This means P(X ≥ 2). This is calculated using:
P(X ≥ 2) = 1 – P(X < 2) = 1 – [P(X = 0) + P(X = 1)]
From the previous calculations:
P(X = 0) = 0.2785
P(X = 1) = 0.3798
P(X ≥ 2) = 1 – (0.2785 + 0.3798) = 1 – 0.6583 = 0.3417
Answer: The probability that a batch contains at least 2 rejects is 0.3417 or 34.17%.
Question (d): Standard Normal Table Values
These calculations use the standard normal distribution Z ~ N(0, 1). The values are found using standard normal tables (Z-tables).
(i) P(Z < 1.5)
From the standard normal table, the value for Z = 1.5 is 0.9332.
Answer: P(Z < 1.5) = 0.9332
(ii) P(-1.5 < Z < 1.5)
This probability is calculated as:
P(-1.5 < Z < 1.5) = P(Z < 1.5) – P(Z < -1.5)
Because the curve is symmetric:
P(Z < -1.5) = 1 – P(Z < 1.5) = 1 – 0.9332 = 0.0668
Therefore:
P(-1.5 < Z < 1.5) = 0.9332 – 0.0668 = 0.8664
Answer: P(-1.5 < Z < 1.5) = 0.8664
(iii) P(Z > 1.046)
First, find P(Z < 1.046) from the table = 0.8523
Then:
P(Z > 1.046) = 1 – P(Z < 1.046) = 1 – 0.8523 = 0.1477
Answer: P(Z > 1.046) = 0.1477
(iv) P(0 < Z < 1.05)
This probability is:
P(0 < Z < 1.05) = P(Z < 1.05) – P(Z < 0)
From the table, P(Z < 1.05) = 0.8531
P(Z < 0) = 0.5000 (the mean divides the distribution in half)
P(0 < Z < 1.05) = 0.8531 – 0.5000 = 0.3531
Answer: P(0 < Z < 1.05) = 0.3531
(v) P(|Z| ≤ 1.86)
The absolute value means:
P(|Z| ≤ 1.86) = P(-1.86 ≤ Z ≤ 1.86)
This equals:
P(-1.86 ≤ Z ≤ 1.86) = P(Z < 1.86) – P(Z < -1.86)
From the table, P(Z < 1.86) = 0.9686
By symmetry, P(Z < -1.86) = 1 – 0.9686 = 0.0314
P(|Z| ≤ 1.86) = 0.9686 – 0.0314 = 0.9372
Answer: P(|Z| ≤ 1.86) = 0.9372
Question (e): Battery Life Problem
The battery life follows a normal distribution with μ = 50 months and σ = 6 months.
(i) P(Battery Lasts Less Than 48 Months)
First, calculate the Z-score:
Z = (X – μ) / σ = (48 – 50) / 6 = -2 / 6 = -0.333
From the standard normal table, P(Z < -0.33) = 0.3707
Answer: P(Battery lasts less than 48 months) = 0.3707 or 37.07%
(ii) P(Sample Mean of 36 Batteries < 48 Months)
This question involves the sampling distribution of the mean. When the sample size is n = 36, the distribution of the sample mean follows:
X̄ ~ N(μ, σ²/n)
The standard error is:
SE = σ / √n = 6 / √36 = 6 / 6 = 1
Calculate the Z-score for the sample mean:
Z = (X̄ – μ) / SE = (48 – 50) / 1 = -2
From the standard normal table, P(Z < -2) = 0.0228
Answer: P(Sample mean < 48 months) = 0.0228 or 2.28%
Note: The probability is much smaller for the sample mean than for a single battery. This is because the sample mean has less variation than individual values.
Question (f): Student Marks Problem
The marks follow a normal distribution with:
• Mean (μ) = 60
• Variance (σ²) = 100
• Standard deviation (σ) = √100 = 10
• Pass marks = 40
The question asks for the probability that a student does not pass. This means the probability of scoring less than 40.
Calculate the Z-score:
Z = (X – μ) / σ = (40 – 60) / 10 = -20 / 10 = -2
From the standard normal table:
P(Z < -2) = 0.0228
Answer: The probability that a student does not pass the exam is 0.0228 or 2.28%.
Interpretation: Only about 2.28% of students in this class will fail the exam. This is a very small percentage. Most students will pass because the pass mark (40) is two standard deviations below the mean.
Example from Sri Lanka: This type of distribution can be used to analyze G.C.E. Ordinary Level or Advanced Level examination marks in Sri Lanka. If a subject has a mean mark of 60 and variance of 100, teachers can predict the failure rate.
Question (g): MBA Graduates Weekly Income
The problem gives:
• Population mean (μ) = $600
• Population standard deviation (σ) = $100
• Sample size (n) = 25
The question asks for the probability that the sample mean is less than $550.
The sampling distribution of the mean follows:
X̄ ~ N(μ, σ²/n)
Calculate the standard error:
SE = σ / √n = 100 / √25 = 100 / 5 = 20
Calculate the Z-score:
Z = (X̄ – μ) / SE = (550 – 600) / 20 = -50 / 20 = -2.5
From the standard normal table:
P(Z < -2.5) = 0.0062
Answer: The probability that 25 randomly selected graduates have an average weekly income of less than $550 is 0.0062 or 0.62%.
Interpretation: This probability is very small. It is very unlikely that a random sample of 25 MBA graduates would have an average income below $550. This happens because the sample size reduces the standard error, making extreme values of the sample mean less likely.
Question (h): Student Height Problem
The problem gives:
• Sample size (n) = 400
• Sample mean (X̄) = 67.39 inches
• Variance (σ²) = 1.69
• Standard deviation (σ) = √1.69 = 1.3 inches
(i) Distribution of Sample Mean
According to the Central Limit Theorem, when the sample size is large (n ≥ 30), the distribution of the sample mean is approximately normal (OpenStax, 2023). In this case, n = 400, which is much larger than 30.
The sample mean follows:
X̄ ~ N(μ, σ²/n)
The standard error is:
SE = σ / √n = 1.3 / √400 = 1.3 / 20 = 0.065 inches
Answer: The distribution of the sample mean is approximately normal with mean μ = 67.39 inches and standard error SE = 0.065 inches. The notation is X̄ ~ N(67.39, 0.004225).
(ii) Number of Students with Height Greater Than 67.5 Inches
This question asks about individual students, not the sample mean. The question requires finding the proportion of students with height greater than 67.5 inches, then multiplying by the sample size.
Calculate the Z-score for individual height:
Z = (X – μ) / σ = (67.5 – 67.39) / 1.3 = 0.11 / 1.3 = 0.085
From the standard normal table:
P(Z < 0.085) ≈ 0.5339
Therefore:
P(X > 67.5) = 1 – P(Z < 0.085) = 1 – 0.5339 = 0.4661
Number of students = 0.4661 × 400 = 186.44 ≈ 186 students
Answer: Approximately 186 students in the sample have an average height greater than 67.5 inches.
This type of analysis can be used to study the height distribution of students in Sri Lankan universities. Universities can use this information for designing furniture, sports equipment, or dormitory facilities.
Estimation
(a) What do you understand by interval estimate? Explain.
(b) “There is an inverse relationship between the width of the interval and the sample size”. Prove this statement using your own example. Explain the importance of confidence interval in research.
(c) Suppose we want to estimate the average weight of an adult male in Colombo city, Sri Lanka. We draw a random sample of 1,000 men from a population of 1,000,000 men. Sample mean = 80 Kg, sample standard deviation = 30 Kg. Compute the 95% confidence interval. Interpret the results.
(d) A financial officer surveys 500 accounts and finds that 300 are more than 30 days overdue. Compute a 90% confidence interval for the true proportion percent of accounts receivable that are more than 30 days overdue, and interpret the confidence interval.
(e) Suppose a study is performed on three pills thought to reduce the risk of stomach ulcers. Relative risk (pill vs placebo) is obtained along with 95% confidence intervals. How do you interpret the data? Which pill is most promising?
- (i) Pill A: RR = 0.8, 95% CI = (0.62–0.91)
- (ii) Pill B: RR = 0.6, 95% CI = (0.50–1.10)
- (iii) Pill C: RR = 1.02, 95% CI = (1.04–1.33)
Model Answer
(a) Understanding Interval Estimate
An interval estimate is a range of values used to estimate an unknown population parameter. This approach differs from a point estimate. A point estimate gives only one value. An interval estimate gives a lower limit and an upper limit (Walpole et al., 2012).
The interval estimate provides more information than a single number. It shows where the true population parameter is likely to be. This range is called a confidence interval (Montgomery and Runger, 2014).
The confidence interval has three main parts. First, there is a lower confidence limit (LCL). Second, there is an upper confidence limit (UCL). Third, there is a confidence level, which shows how confident we are that the true parameter falls within this range (Anderson et al., 2017).
For example, if a researcher wants to estimate the average income of tea plantation workers in Nuwara Eliya, Sri Lanka, a point estimate might be Rs. 25,000 per month. However, an interval estimate might be Rs. 23,000 to Rs. 27,000 with 95% confidence. This tells us more. It shows that the true average income is likely between these two values.
The confidence level is important. A 95% confidence level means that if we repeat the sampling process many times, about 95 out of 100 intervals will contain the true population mean (Freedman et al., 2007). This does not mean there is a 95% chance the true mean is in this specific interval. The true mean is fixed. The interval is what changes from sample to sample.
(b) Inverse Relationship Between Interval Width and Sample Size
The statement “There is an inverse relationship between the width of the interval and the sample size” is true. This section proves this statement using an example. It also explains why confidence intervals are important in research.
Proof Using Example
Consider a study about monthly household expenses in Gampaha district, Sri Lanka. The population standard deviation is known to be Rs. 15,000. The confidence level is 95%, so z0.025=1.96.
Case 1: Sample size n = 50
The margin of error is:
E=zα/2×nσ=1.96×5015000=1.96×2121.32=4157.79
The interval width is:
Width=2×E=2×4157.79=8315.58
Case 2: Sample size n = 200
The margin of error is:
E=1.96×20015000=1.96×1060.66=2078.89
The interval width is:
Width=2×E=2×2078.89=4157.78
Case 3: Sample size n = 500
The margin of error is:
E=1.96×50015000=1.96×670.82=1314.81
The interval width is:
Width=2×E=2×1314.81=2629.62
These calculations show a clear pattern. When sample size increases from 50 to 200 (four times larger), the interval width decreases from 8315.58 to 4157.78 (about half). When sample size increases to 500, the width decreases further to 2629.62. This proves the inverse relationship (Cochran, 1977).
The mathematical relationship is clear. The interval width contains n in the denominator. As n increases, n increases, making the fraction smaller. This makes the margin of error smaller. Therefore, the interval becomes narrower (Zikmund et al., 2013).
Importance of Confidence Intervals in Research
Confidence intervals are important in research for several reasons. This section discusses these reasons.
First, confidence intervals show uncertainty. Research always involves uncertainty because researchers work with samples, not entire populations (Cumming and Finch, 2005). A confidence interval shows this uncertainty clearly. For example, if a medical researcher in Colombo studies blood pressure levels, the confidence interval shows the range where the true average likely falls.
Second, confidence intervals help with decision-making. Researchers and policymakers need to make decisions based on sample data. A narrow confidence interval suggests the estimate is precise. A wide confidence interval suggests more caution is needed (Gardner and Altman, 1986). For instance, if the Sri Lankan Ministry of Health wants to estimate vaccination rates, a narrow confidence interval gives more confidence in planning decisions.
Third, confidence intervals are better than hypothesis tests alone. Many researchers now prefer confidence intervals over simple yes/no hypothesis tests. This is because confidence intervals show both the size of an effect and its precision (Cumming, 2014). For example, in agricultural research in Sri Lanka, knowing that a new fertilizer increases rice yield by 200-400 kg per hectare (95% CI) is more useful than just knowing the increase is “statistically significant.”
Fourth, confidence intervals help compare groups. When comparing two groups, overlapping confidence intervals suggest the groups might not be different. Non-overlapping intervals suggest a real difference exists (Altman et al., 2000). This is useful in education research in Sri Lanka, for example, when comparing examination results between districts.
However, confidence intervals have limitations. They depend on the sample being random and representative. If the sample is biased, the confidence interval will be misleading (Thompson, 2012). They also assume certain statistical conditions are met, such as normal distribution for smaller samples.
(c) Confidence Interval for Average Weight of Adult Males in Colombo
This section calculates the 95% confidence interval for the average weight of adult males in Colombo city, Sri Lanka. Then it interprets the results.
Given information:
- Population size (N) = 1,000,000 men
- Sample size (n) = 1,000 men
- Sample mean (xˉ) = 80 kg
- Sample standard deviation (s) = 30 kg
- Confidence level = 95%
Calculation:
Since the sample size is large (n > 30), the Central Limit Theorem applies. This means the sampling distribution of the mean is approximately normal (Field, 2013). Therefore, the Z distribution can be used even though the population standard deviation is unknown.
For a 95% confidence level:
- α=0.05
- α/2=0.025
- z0.025=1.96
The formula for the confidence interval is:
xˉ±zα/2×ns
First, calculate the standard error:
SE=ns=100030=31.62330=0.949
Next, calculate the margin of error:
E=zα/2×SE=1.96×0.949=1.860
Therefore, the 95% confidence interval is:
80±1.860
Lower limit: 80−1.860=78.14 kg
Upper limit: 80+1.860=81.86 kg
95% Confidence Interval: (78.14 kg, 81.86 kg)
Interpretation:
The results can be interpreted in several ways. First, the researchers can be 95% confident that the true average weight of adult males in Colombo city lies between 78.14 kg and 81.86 kg. This does not mean there is a 95% probability that the true mean is in this interval. The true mean is fixed. The interval is what varies from sample to sample (Neyman, 1937).
Second, if the same sampling procedure was repeated 100 times, approximately 95 of the resulting confidence intervals would contain the true population mean. About 5 intervals would not contain it (Morey et al., 2016).
Third, the interval is quite narrow (width = 3.72 kg). This suggests the estimate is precise. The large sample size (n = 1,000) contributes to this precision. A larger sample size produces narrower intervals, as shown in section (b).
Fourth, this information has practical uses. Health authorities in Colombo can use these results for planning purposes. For example, they can estimate medication dosages, plan nutritional programs, or design health interventions for the male population.
However, several assumptions must be met for this interpretation to be valid. The sample must be randomly selected from the population. The sampling must be done properly to avoid bias. If the sample was taken only from certain areas of Colombo or certain socioeconomic groups, the results might not represent all adult males in the city (Lohr, 2019).
(d) Confidence Interval for Proportion of Overdue Accounts
This section calculates the 90% confidence interval for the true proportion of accounts that are more than 30 days overdue. Then it interprets the results.
Given information:
- Sample size (n) = 500 accounts
- Number of overdue accounts (x) = 300
- Confidence level = 90%
Calculation:
First, calculate the sample proportion:
p^=nx=500300=0.60
For a 90% confidence level:
- α=0.10
- α/2=0.05
- z0.05=1.645
The formula for the confidence interval for a proportion is:
p^±zα/2np^(1−p^)
Calculate the standard error:
SE=np^(1−p^)=5000.60×0.40=5000.24=0.00048=0.0219
Calculate the margin of error:
E=zα/2×SE=1.645×0.0219=0.0360
Therefore, the 90% confidence interval is:
0.60±0.0360
Lower limit: 0.60−0.0360=0.564=56.4%
Upper limit: 0.60+0.0360=0.636=63.6%
90% Confidence Interval: (56.4%, 63.6%)
Interpretation:
Several points can be made about these results. First, the financial officer can be 90% confident that the true proportion of accounts that are more than 30 days overdue is between 56.4% and 63.6%. This is a point estimate with an associated margin of error (Agresti and Coull, 1998).
Second, the point estimate is 60%. This means that in the sample, 60% of accounts are overdue by more than 30 days. This is a high percentage. It suggests the organization has significant problems with account collections.
Third, even the lower limit of the interval (56.4%) is quite high. This means that in the worst case (within 90% confidence), more than half of all accounts are overdue. This should concern management.
Fourth, the interval width is 7.2 percentage points (63.6% – 56.4%). This is relatively narrow. The large sample size (n = 500) contributes to this precision. If the sample size was smaller, the interval would be wider and less precise.
Fifth, this information has important business implications. The company might need to review its credit policies. It might need to improve its collection procedures. It might need to assess the creditworthiness of customers more carefully (Brealey et al., 2020).
For example, if this was a Sri Lankan company, it might compare these results with industry standards in Sri Lanka. If other companies have lower overdue rates, this company is underperforming. If the rate is similar to competitors, the problem might be industry-wide or economic conditions in Sri Lanka.
However, the interpretation assumes the sample is random and representative of all accounts. If the 500 accounts were not randomly selected, the results might be biased (Cochran, 1977). For instance, if the financial officer only sampled large accounts or only corporate accounts, the results would not represent all account types.
(e) Interpreting Relative Risk Data with Confidence Intervals
This section interprets the relative risk (RR) data for three pills. Each pill is tested against a placebo to see if it reduces the risk of stomach ulcers. The section explains what the results mean and identifies which pill is most promising.
Understanding Relative Risk:
Relative risk compares the risk of an outcome in two groups. In this case, it compares the risk of stomach ulcers in people taking a pill versus people taking a placebo (Szklo and Nieto, 2014). An RR of 1.0 means no difference between groups. An RR less than 1.0 means the pill reduces risk. An RR greater than 1.0 means the pill increases risk.
The 95% confidence interval shows where the true RR is likely to be. If the confidence interval includes 1.0, the result is not statistically significant at the 5% level. This means the pill might have no real effect (Altman and Bland, 2011).
Interpretation of Each Pill:
Pill A: RR = 0.8, 95% CI = (0.62–0.91)
Pill A shows a protective effect. The RR of 0.8 means people taking Pill A have 80% of the risk compared to people taking placebo. This is a 20% reduction in risk (Noordzij et al., 2017).
The confidence interval is (0.62–0.91). Both limits are below 1.0. This means the result is statistically significant. The true risk reduction is likely between 9% and 38% (calculated as 1−0.91=0.09 and 1−0.62=0.38).
This result is promising. It shows consistent evidence that Pill A reduces stomach ulcer risk. Researchers can be confident that Pill A has a real protective effect.
Pill B: RR = 0.6, 95% CI = (0.50–1.10)
Pill B shows a larger point estimate for risk reduction. The RR of 0.6 suggests a 40% reduction in risk. This is better than Pill A’s point estimate.
However, the confidence interval is (0.50–1.10). The upper limit is 1.10, which is above 1.0. This means the confidence interval includes 1.0. Therefore, the result is not statistically significant (Ranganathan et al., 2016).
The interval is wide. It suggests high uncertainty. The true effect could be anywhere from a 50% risk reduction (RR = 0.50) to a 10% risk increase (RR = 1.10). This uncertainty makes Pill B less promising than it first appears.
The wide interval might occur for several reasons. The sample size might be too small. There might be high variability in the data. There might be problems with how the study was conducted (Schulz and Grimes, 2005).
Pill C: RR = 1.02, 95% CI = (1.04–1.33)
Pill C shows a harmful effect. The RR of 1.02 suggests a 2% increase in risk. This means Pill C might slightly increase the risk of stomach ulcers rather than reduce it.
The confidence interval is (1.04–1.33). Both limits are above 1.0. This means the result is statistically significant. However, there is a problem here. The point estimate (1.02) is lower than the lower limit (1.04). This seems inconsistent and might be a reporting error in the data provided.
Assuming the confidence interval is correct, Pill C increases stomach ulcer risk by at least 4% and possibly by as much as 33%. This is clearly harmful. Pill C should not be used.
Which Pill Is Most Promising?
Pill A is the most promising option. It shows a statistically significant risk reduction with a narrow confidence interval. The results are consistent and reliable.
Pill B has a larger point estimate for risk reduction. However, the result is not statistically significant. The wide confidence interval shows too much uncertainty. More research with a larger sample size would be needed before Pill B could be recommended (Pocock et al., 2016).
Pill C is not promising. It increases risk rather than reduces it. It should not be developed further.
Practical Implications:
For health policy in countries like Sri Lanka, these results have implications. Pill A could be recommended for preventing stomach ulcers in high-risk patients. However, other factors matter too. These include the pill’s cost, side effects, and availability (World Health Organization, 2015).
Pill B needs more research. It should not be ruled out completely. The point estimate suggests it might be more effective than Pill A. However, the current evidence is insufficient. A larger clinical trial would provide clearer answers (Moher et al., 2010).
Pill C should not be used. Healthcare providers should be aware that it increases risk. This shows why proper statistical evaluation is important before approving new medications (Ioannidis, 2005).
References
Agresti, A. and Coull, B.A. (1998) ‘Approximate is better than “exact” for interval estimation of binomial proportions’, The American Statistician, 52(2), pp. 119-126.
Altman, D.G. and Bland, J.M. (2011) ‘How to obtain the P value from a confidence interval’, BMJ, 343, d2304.
Altman, D.G., Machin, D., Bryant, T.N. and Gardner, M.J. (2000) Statistics with confidence. 2nd edn. London: BMJ Books.
Anderson, D.R., Sweeney, D.J., Williams, T.A., Camm, J.D. and Cochran, J.J. (2017) Statistics for business and economics. 13th edn. Boston: Cengage Learning.
Brealey, R.A., Myers, S.C. and Allen, F. (2020) Principles of corporate finance. 13th edn. New York: McGraw-Hill Education.
Cochran, W.G. (1977) Sampling techniques. 3rd edn. New York: John Wiley & Sons.
Cumming, G. (2014) ‘The new statistics: Why and how’, Psychological Science, 25(1), pp. 7-29.
Cumming, G. and Finch, S. (2005) ‘Inference by eye: Confidence intervals and how to read pictures of data’, American Psychologist, 60(2), pp. 170-180.
Field, A. (2013) Discovering statistics using IBM SPSS statistics. 4th edn. London: SAGE Publications.
Freedman, D., Pisani, R. and Purves, R. (2007) Statistics. 4th edn. New York: W.W. Norton & Company.
Gardner, M.J. and Altman, D.G. (1986) ‘Confidence intervals rather than P values: Estimation rather than hypothesis testing’, BMJ, 292(6522), pp. 746-750.
Ioannidis, J.P. (2005) ‘Why most published research findings are false’, PLoS Medicine, 2(8), e124.
Lohr, S.L. (2019) Sampling: Design and analysis. 2nd edn. Boca Raton: CRC Press.
Moher, D., Hopewell, S., Schulz, K.F., Montori, V., Gøtzsche, P.C., Devereaux, P.J., Elbourne, D., Egger, M. and Altman, D.G. (2010) ‘CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomised trials’, BMJ, 340, c869.
Montgomery, D.C. and Runger, G.C. (2014) Applied statistics and probability for engineers. 6th edn. Hoboken: John Wiley & Sons.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D. and Wagenmakers, E.J. (2016) ‘The fallacy of placing confidence in confidence intervals’, Psychonomic Bulletin & Review, 23(1), pp. 103-123.
Neyman, J. (1937) ‘Outline of a theory of statistical estimation based on the classical theory of probability’, Philosophical Transactions of the Royal Society of London. Series A, 236(767), pp. 333-380.
Noordzij, M., van Diepen, M., Caskey, F.C. and Jager, K.J. (2017) ‘Relative risk versus absolute risk: One cannot be interpreted without the other’, Nephrology Dialysis Transplantation, 32(suppl_2), pp. ii13-ii18.
Pocock, S.J., McMurray, J.J. and Collier, T.J. (2016) ‘Statistical controversies in reporting of clinical trials: Part 2 of a 4-part series on statistics for clinical trials’, Journal of the American College of Cardiology, 66(23), pp. 2648-2662.
Ranganathan, P., Pramesh, C.S. and Buyse, M. (2016) ‘Common pitfalls in statistical analysis: The perils of multiple testing’, Perspectives in Clinical Research, 7(2), pp. 106-107.
Schulz, K.F. and Grimes, D.A. (2005) ‘Sample size calculations in randomised trials: Mandatory and mystical’, The Lancet, 365(9467), pp. 1348-1353.
Szklo, M. and Nieto, F.J. (2014) Epidemiology: Beyond the basics. 3rd edn. Burlington: Jones & Bartlett Learning.
Thompson, S.K. (2012) Sampling. 3rd edn. Hoboken: John Wiley & Sons.
Walpole, R.E., Myers, R.H., Myers, S.L. and Ye, K. (2012) Probability and statistics for engineers and scientists. 9th edn. Boston: Prentice Hall.
World Health Organization (2015) WHO guideline on country pharmaceutical pricing policies. Geneva: World Health Organization.
Zikmund, W.G., Babin, B.J., Carr, J.C. and Griffin, M. (2013) Business research methods. 9th edn. Mason: South-Western Cengage Learning.
