- Level: AS and A Level
- Subject: Maths
- Word count: 4305
Identifying Relationships -Introduction to Statistical Inference.
Extracts from this document...
Introduction
Lecture 6 - MG2007 Data Analysis
Identifying Relationships –Introduction to Statistical Inference
BOX 1 Further Analysis ( chi-squared )
Categorical response variable and categorical factor
RV | C/M | Factor | C/M | Analysis | Further Analysis |
Accepts package tours PACKAGES | C | Type of Hotel TYPE | C | Crosstabs | Chi-Square |
Cross-tabulations
Giving frequency count for each combination of categories on the two variables of interest
Convention for constructing the table
The column variable is the independent variable ( the factor )
The row variable is the dependent variable ( the RV )
Accepts package tours * Type of Hotel Crosstabulation
Type of | Hotel | Total | ||||
Class 1 Luxury | Class 2 Medium | Class 3 Basic | Class 4 B and B | |||
Accepts package tours | Yes | 5 | 2 | 14 | 25 | 46 |
No | 2 | 10 | 15 | 12 | 39 | |
Total | 7 | 12 | 29 | 37 | 85 |
Example information given in the table
How many of the sample hotels take package customers
How many of the sample hotels do not take package customers
How many of the sample hotels are medium sized
How many of the sample hotels are bed and breakfast only
How many of the luxury hotels also take packages
The actual number of cases is not too helpful
Of the 7 luxury hotels, 5 take packages
Of the 39 hotels who do not take packages, 12 are bed and breakfast establishments
but are these figures in any way significant?
The first step to analysing the table is to ask SPSS to calculate the % of cases in each cell. The percentages are calculated in the direction of the factor - the factor is the column variable.
Accepts package tours * Type of Hotel Crosstabulation
Type of | Hotel | Total | |||||
Class 1 Luxury | Class 2 Medium | Class 3 Basic | Class 4 B and B | ||||
Accepts package tours | Yes | Count | 5 | 2 | 14 | 25 | 46 |
% within Type of Hotel | 71.4% | 16.7% | 48.3% | 67.6% | 54.1% | ||
No | Count | 2 | 10 | 15 | 12 | 39 | |
% within Type of Hotel | 28.6% | 83.3% | 51.7% | 32.4% | 45.9% | ||
Total | Count | 7 | 12 | 29 | 37 | 85 | |
% within Type of Hotel | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
Out of the 85 respondents taking part in the survey:
7 hotels fall into the group CLASS LUXURY
of those 5 or 71.4% do take package customers, compared to 54.1% for all Hotels.
2 or 28.6% do not take package customers, compared to 45.9% for all Hotels.
12 hotels fall into the group CLASS 2 MEDIUM SIZED
of those 2 or 16.7% do take package customers, compared to 54.1% for all Hotels
10 or 83.3% do not take package customers, compared to 45.9% for all Hotels
Middle
Class 2 Medium
Class 3 Basic
Class 4
B and B
Accepts package tours
Yes
Count
5
2
14
25
46
Expected Count
3.8
6.5
15.7
20.0
46.0
No
Count
2
10
15
12
39
Expected Count
3.2
5.5
13.3
17.0
39.0
Total
Count
7
12
29
37
85
Expected Count
7.0
12.0
29.0
37.0
85.0
χ2Calculation:
Small differences between observed and expected produces a small contribution to the χ2statistic
Large differences between observed and expected produces a large contribution to the χ2statistic
SPSS calculates the value of the χ2statistic for us:
Chi-Square Tests
Value | df | Asymp. Sig. (2-sided) | |
Pearson Chi-Square | 10.717 | 3 | .013 |
Likelihood Ratio | 11.274 | 3 | .010 |
Linear-by-Linear Association | 2.615 | 1 | .106 |
N of Valid Cases | 85 |
Making a decision:
In effect the null hypothesis is presumed innocent until proven guilty
We require a decision rule to help us to test the hypothesis we have stated
The decision rule
We set up a decision rule - detailed explanation of the theory behind the use of this rule will not be discussed on this module. It will be used as a methodology for determining the existence of a relationship when you are unsure and no more.
There are many chi-square distributions. The one used is determined by degrees of freedom.
Degrees of freedom are actually calculated using the following formula
df = ( no. of rows in the table - 1 ) ( no. of columns in the table - 1 ) = ( 2- 1 )( 4-1 ) = (1)(3) giving 3df (but the SPSS output calculates the df for you)
DIAGRAM OF CHI-SQUARED DISTRIBUTION and DECISION RULE
The 95% decision point or the critical value is taken from pre-printed chi-square tables using the degrees of freedom (df) given.
Using the tables provided, what is the critical value for this example?
Applying the decision rule
It is unlikely that we will get a value of the test statistic in the 5% region.
Given that a value lying in the 5% region is very unlikely, we shall reject the null hypothesis if the value of the chi-squared statistic calculated from the sample data falls in this region.
What is the value of the χ2 statistic calculated by SPSS from the sample data? (see SPSS output)
If the value of our test statistic falls inside the 5% region on the diagram
reject H0: in favour of the alternative hypothesis
i.e. on the basis of the sample data there is a relationship between the two variables
If the value of our test statistic falls inside the 95% region on the diagram
we do not reject H0:
i.e. on the basis of the sample data, there is no relationship between the two variables
Analysis Conclusion
Strength of the relationship
Type 1 and Type 11 Errors
Hypothesis testing is not foolproof!
It is possible to make an error but it obviously would be bad luck if we did considering the small number of instances of a value in this region
Statistical Hypothesis Testing is a reasonable decision procedure in the face of two types of unavoidable ignorances
a) we will never know the truth
- we will never know whether our decision is correct or incorrect
CHI-SQUARED OUTPUT OF PACKAGES and TYPE
SPSS for Windows
Get the HOTELS data file
Select Analyze
Select Descriptive Statistics
Select Crosstabs
Move the variable Accepts package tours (packages) into the Rows box
Move the variable Type of Hotel (type) into the Columns box
Select Statistics button at the foot of the window
Select Chi-square
Select Continue
Select Cells button at the foot of the window
Select Expected in the counts box to give the expected frequencies.
Select Column in the Percentages box to give the column percentages
Select Continue
The OK to get the required output.
Lecture 7 - MG2007 Data Analysis
Plotting combinations of variables - categorical & measured
The Car Ownership Study
Objective 1 IDA Plan
Response variable | C/M | Factor | C/M | Initial method of analysis | Further analysis | Result |
Number of cars NUMCARS | C | INCOME | C | Crosstabs | Chi-squared | |
No. of Family members FAMILY | C | Crosstabs | Chi-squared | |||
No. years in Education EDUCATE | M | Comparison of means | T-test | |||
Region of residence REGION | C | Crosstabs | Chi-squared |
Conclusion
Type 0 ( the code for group 1 of the attribute variable ) into the Group 1: box
Type 1 ( the code for group 2 of your attribute variable ) into the Group 2 box
Select Continue
OK
Unequal or equal variances?
Levene test for equality of Variances
Sig. >0.05 | Equal Variances assumed |
Sig. <0.05 | Equal Variances not assumed |
The value of the test statistic t-calc ( short for the value of t calculated from the sample data ) is calculated by SPSS, for this example is –4.758, and the degrees of freedom are 380.
Looking up the threshold value of t in statistical tables:
We are working at a 95% level of confidence and completing a two-tailed test.
Is the sample data compatible with the null hypothesis?
Unlikely to get a value of t in the critical region. The critical region consists of all those values of the test statistic that provide strong evidence of the alternative hypothesis. There is only a 5% probability that we will observe a value in this region. Hence a value in here will lead to a rejection of the null hypothesis.
The conclusions?
Describing the relationship:
Confidence Intervals - calculated using the SPSS output
95% confident that the difference between the mean amount borrowed by owner-occupiers and the mean amount borrowed by those renting lies in the interval -51.56 and -21.41.
95% confident that owner-occupiers borrow, on average, between £21.41 and £51.56 less than those renting. Values anywhere in this range are possible
Remember:
- we cannot be 100% confident unless we carry out a census
- hypothesis testing is not foolproof!
- we can make one of two errors, known as Type I and Type II errors
Type I error reject the null hypothesis when in fact it is true
Type II error accept the null hypothesis when in fact it is false and should be rejected
The only way to arrange things so that the probability of both Type 1 and Type 11 errors is minimised, is to use large samples ( >30 )
For information only,
Decision
Do not reject H0 | Reject H0 | |
H0True | Correct decision | Type I error |
H0False | Type II error | Correct decision |
MG2007 Page A.Haines
This student written piece of work is one of many that can be found in our AS and A Level Probability & Statistics section.
Found what you're looking for?
- Start learning 29% faster today
- 150,000+ documents available
- Just £6.99 a month