• Join over 1.2 million students every month
• Accelerate your learning by 29%
• Unlimited access from just £6.99 per month
Page
1. 1
1
2. 2
2
3. 3
3
4. 4
4
5. 5
5
6. 6
6
7. 7
7
8. 8
8
9. 9
9
10. 10
10
11. 11
11
12. 12
12
13. 13
13
14. 14
14
15. 15
15

# Identifying Relationships -Introduction to Statistical Inference.

Extracts from this document...

Introduction

Lecture 6 - MG2007 Data Analysis

Identifying Relationships –Introduction to Statistical Inference

## BOX 1         Further Analysis ( chi-squared )

Categorical response variable and categorical factor

 RV C/M Factor C/M Analysis FurtherAnalysis Accepts package toursPACKAGES C Type of HotelTYPE C Crosstabs Chi-Square

## Cross-tabulations

Giving frequency count for each combination of categories on the two variables of interest

Convention for constructing the table

The column variable is the independent variable ( the factor )

The row variable is the dependent variable ( the RV )

Accepts package tours * Type of Hotel Crosstabulation

 Type of Hotel Total Class 1 Luxury Class 2 Medium Class 3 Basic Class 4B and B Accepts package tours Yes 5 2 14 25 46 No 2 10 15 12 39 Total 7 12 29 37 85

Example information given in the table

How many of the sample hotels take package customers

How many of the sample hotels do not take package customers

How many of the sample hotels are medium sized

How many of the sample hotels are bed and breakfast only

How many of the luxury hotels also take packages

The actual number of cases is not too helpful

Of the 7 luxury hotels, 5 take packages

Of the 39 hotels who do not take packages, 12 are bed and breakfast establishments

but are these figures in any way significant?

The first step to analysing the table is to ask SPSS to calculate the % of cases in each cell.  The percentages are calculated in the direction of the factor - the factor is the column variable.

Accepts package tours * Type of Hotel Crosstabulation

 Type of Hotel Total Class 1 Luxury Class 2 Medium Class 3 Basic Class 4B and B Accepts package tours Yes Count 5 2 14 25 46 % within Type of Hotel 71.4% 16.7% 48.3% 67.6% 54.1% No Count 2 10 15 12 39 % within Type of Hotel 28.6% 83.3% 51.7% 32.4% 45.9% Total Count 7 12 29 37 85 % within Type of Hotel 100.0% 100.0% 100.0% 100.0% 100.0%

Out of the 85 respondents taking part in the survey:

7 hotels  fall into the group CLASS LUXURY

of those        5 or 71.4% do take package customers, compared to 54.1% for all Hotels.

2 or 28.6% do not take package customers, compared to 45.9% for all Hotels.

12 hotels  fall into the group CLASS 2 MEDIUM SIZED

of those                2 or 16.7% do take package customers, compared to 54.1% for all Hotels

10 or 83.3% do not take package customers, compared to 45.9% for all Hotels

Middle

Class 2 Medium

Class 3 Basic

Class 4

B and B

Accepts package tours

Yes

Count

5

2

14

25

46

Expected Count

3.8

6.5

15.7

20.0

46.0

No

Count

2

10

15

12

39

Expected Count

3.2

5.5

13.3

17.0

39.0

Total

Count

7

12

29

37

85

Expected Count

7.0

12.0

29.0

37.0

85.0

χ2Calculation:

Small differences between observed and expected produces a small contribution to the χ2statistic

Large differences between observed and expected produces a large contribution to the χ2statistic

SPSS calculates the value of the  χ2statistic for us:

Chi-Square Tests

 Value df Asymp. Sig. (2-sided) Pearson Chi-Square 10.717 3 .013 Likelihood Ratio 11.274 3 .010 Linear-by-Linear Association 2.615 1 .106 N of Valid Cases 85

Making a decision:

In effect the null hypothesis is presumed innocent until proven guilty

We require a decision rule to help us to test the hypothesis we have stated

The decision rule

We set up a decision rule - detailed explanation of the theory behind the use of this rule will not be discussed on this module.  It will be used as a methodology for determining the existence of a relationship when you are unsure and no more.

There are many chi-square distributions. The one used is determined by degrees of freedom.

Degrees of freedom are actually calculated using the following formula

df = ( no. of rows in the table - 1 ) ( no. of columns in the table - 1 )        = ( 2- 1 )( 4-1 )        = (1)(3)        giving 3df (but the SPSS output calculates the df for you)

DIAGRAM OF CHI-SQUARED DISTRIBUTION and DECISION RULE

The 95% decision point or the critical value is taken from pre-printed chi-square tables using the degrees of freedom (df) given.

Using the tables provided, what is the critical value for this example?

Applying the decision rule

It is unlikely that we will get a value of the test statistic in the 5% region.

Given that a value lying in the 5% region is very unlikely, we shall reject the null hypothesis if the value of the chi-squared statistic calculated from the sample data falls in this region.

What is the value of the χ2 statistic calculated by SPSS from the sample data?  (see SPSS output)

If the value of our test statistic falls inside the 5% region on the diagram

reject H0: in favour of the alternative hypothesis

i.e. on the basis of the sample data there is a relationship between the two variables

If the value of our test statistic falls inside the 95% region on the diagram

we do not reject H0:

i.e. on the basis of the sample data, there is no relationship between the two variables

Analysis Conclusion

Strength of the relationship

Type 1 and Type 11 Errors

Hypothesis testing is not foolproof!

It is possible to make an error but it obviously would be bad luck if we did considering the small number of instances of a value in this region

Statistical Hypothesis Testing is a reasonable decision procedure in the face of two types of unavoidable ignorances

a)        we will never know the truth

1. we will never know whether our decision is correct or incorrect

CHI-SQUARED OUTPUT OF PACKAGES and TYPE

SPSS for Windows

Get the HOTELS data file

Select Analyze

Select Descriptive Statistics

Select Crosstabs

Move the variable Accepts package tours (packages) into the Rows box

Move the variable Type of Hotel (type) into the Columns box

Select Statistics button at the foot of the window

Select Chi-square

Select Continue

Select Cells button at the foot of the window

Select Expected in the counts box to give the expected frequencies.

Select Column in the Percentages box to give the column percentages

Select Continue

The OK to get the required output.

Lecture 7 - MG2007 Data Analysis

### Plotting combinations of variables - categorical & measured

#### Objective 1 IDA Plan

Response variable

C/M

##### Factor

C/M

Initial method of analysis

Further analysis

Result

Number of cars

NUMCARS

C

INCOME

C

Crosstabs

Chi-squared

No. of Family members

## FAMILY

C

Crosstabs

Chi-squared

No. years in Education

## EDUCATE

M

Comparison of means

T-test

Region of residence

REGION

C

Crosstabs

Chi-squared

Conclusion

Type 0 ( the code for group 1 of the attribute variable ) into the Group 1: box

Type 1 ( the code for group 2 of your attribute variable ) into the Group 2 box

Select Continue

OK

Unequal or equal variances?

Levene test for equality of Variances

 Sig. >0.05 Equal Variances assumed Sig. <0.05 Equal Variances not assumed

The value of the test statistic t-calc ( short for the value of t calculated from the sample data ) is calculated by SPSS, for this example is  –4.758,  and the degrees of freedom are 380.

Looking up the threshold value of t in statistical tables:

We are working at a 95% level of confidence and completing a two-tailed test.

Is the sample data compatible with the null hypothesis?

Unlikely to get a value of t in the critical region.  The critical region consists of all those values of the test statistic that provide strong evidence of the alternative hypothesis.  There is only a 5% probability that we will observe a value in this region. Hence a value in here will lead to a rejection of the null hypothesis.

The conclusions?

Describing the relationship:

Confidence Intervals - calculated using the SPSS output

95% confident that the difference between the mean amount borrowed by owner-occupiers and the mean amount borrowed by those renting lies in the interval -51.56 and -21.41.

95% confident that owner-occupiers borrow, on average, between £21.41 and £51.56 less than those renting. Values anywhere in this range are possible

Remember:

• we cannot be 100% confident unless we carry out a census
• hypothesis testing is not foolproof!
• we can make one of two errors, known as Type I and Type II errors

Type I error         reject the null hypothesis when in fact it is true

Type II error        accept the null hypothesis when in fact it is false and should be rejected

The only way to arrange things so that the probability of both Type 1 and Type 11 errors is minimised, is to use large samples ( >30 )

For information only,

Decision

 Do not reject H0 RejectH0 H0True Correct decision Type I error H0False Type II error Correct decision

MG2007        Page                        A.Haines

This student written piece of work is one of many that can be found in our AS and A Level Probability & Statistics section.

## Found what you're looking for?

• Start learning 29% faster today
• 150,000+ documents available
• Just £6.99 a month

Not the one? Search for your essay title...
• Join over 1.2 million students every month
• Accelerate your learning by 29%
• Unlimited access from just £6.99 per month

# Related AS and A Level Probability & Statistics essays

1. ## The mathematical genii apply their Statistical Wizardry to Basketball

infinite range of shots that may be required to score a basket. The sum of all the probabilities will equal one (a probability density function). If X and Y have a geometric distribution, the distribution should look like this: The sample size shall be 80 as a large sample size

2. ## Standard addition was used to accurately quantify for quinine in an unknown urine sample ...

By substitution of equation 2 into equation 1 we get This equation containing an exponential term can be expanded as a Maclaurin series. Ultimately the following is derived: At constant P0, F = Kc Thus, a plot of the

1. ## Investigating the Relationship Between the Amount of Money a Football Club Receives and its ...

of money would have similar levels of success and thus a low standard deviation of success measuring quantities. Using standard deviation, I will be able to show whether this is the case. One problem of using league position that I can foresee is the fact that there is not the same number of teams in all for divisions.

2. ## Study of the height/diameter ratio of limpets inhabiting the middle shore region of exposed ...

Random co-ordinates are important in ensuring that the data obtained has no link to any decisions made by the person performing the study. These points will be important in guaranteeing a uniform work method. The results obtained will have a greater value when assessing the data with a statistical method.

1. ## Chebyshevs Theorem and The Empirical Rule

* At least 96% of all the ages will lie in the range of .

2. ## Statistics. The purpose of this coursework is to investigate the comparative relationships between the ...

Car Make Model Price when new price age(years0 mileage no. of owners 51 Volkswagen Golf 12999 3595 6 58000 2 52 Ford Escort 12125 4295 3 29000 1 53 Ford Escort 11800 4700 5 34000 1 54 Bentley TurboR 170841 37995 8 55000 1 55 Fiat Punto 7864 4500 3

1. ## &amp;quot;The lengths of lines are easier to guess than angles. Also, that year 11's ...

5 0.4 8.70 45 12 36.36 5 0.4 8.70 45 12 36.36 5 0.4 8.70 45 12 36.36 3 -1.6 -34.78 40 7 21.21 5 0.4 8.70 46 13 39.39 5 0.4 8.70 30 -3 -9.09 4 -0.6 -13.04 30 -3 -9.09 3.5 -1.1 -23.91 40 7 21.21 4.5 -0.1

2. ## Driving test

Box plot (b) - Number of Mistakes All data Sample Median 16 19 Lower Quartile 9 11 Upper Quartile 24 24.5 IQR 15 13.5 Range 36 36 Skewer symmetrical Slightly negative Again the results are very close. I am confident that the sample matches up to the complete data.

• Over 160,000 pieces
of student written work
• Annotated by
experienced teachers
• Ideas and feedback to