Creating Fake Data Sets to Explore Hypotheses

Random data set simulating whale and boat presence/absence

The size of each was 1000, smaller than the size of the actual data, but large enough to get a good simulation
This is a smaller example of the data expected. It was hypothesized that whales are present less frequently when boats are also present, but that boat have a much lower abundance than whales.
both whale and boat abundance were categorical variables

vec0 <- c(50,15)
vec1 <- c(100, 85)
dataMatrix <- rbind(vec0,vec1)
rownames(dataMatrix) <- c("WhaleAbsence", "WhalePresence")
colnames(dataMatrix) <- c("BoatAbsence", "BoatPresence")
str(dataMatrix)

##  num [1:2, 1:2] 50 100 15 85
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "WhaleAbsence" "WhalePresence"
##   ..$ : chr [1:2] "BoatAbsence" "BoatPresence"

non-random numbers were used because when random numbers were generated, they did not look similar to the original data collected. It was suggested that specific numbers be used and a matrix created from those numbers, instead of random generation.

Code to analyze the data

the independent variable was boats (x)
the dependent variable was whales (y)
a chi-squared test was used to analyze the data because both variables were categorical

print(chisq.test(dataMatrix))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dataMatrix
## X-squared = 9.5504, df = 1, p-value = 0.001999

mosaicplot(x=dataMatrix,
           col=c("cornflowerblue","tomato","black"),
           shade=FALSE)

When the values of the table were whale + boat absence = 50, only whale absence = 150, only whale presence = 700, and both presence= 100, the p-value was < 2.2e-16
to check how small the differences between groups could be to still detect a significant pattern, I manually changed the values for each category.
The smallest difference obtained while maintaining a significant p-value was whale + boat absence = 210, only whale absence = 190, only whale presence = 355, and both presence= 245, the p-value was 0.04357. When the overall size of the two “groups” (whales vs boats) was very different, differences within those groups did not significantly change the p-value. However, the more different each of the 4 categories values was, the more significant the data become and the lower the p-value.
When the overall sample size changed and the sample size was decreased 10 fold, small changes made significant differences. This was not the case with a larger sample size. However, small sample sizes needed significant differences between the categories to generate a small p-value. A sample size of 250 was the lowest sample size tried that yielded significant results. However, at this sample size, the categroies within the matrix had to be very different values.

Homework_08

Grace Durant

3/16/2022

Creating Fake Data Sets to Explore Hypotheses

Random data set simulating whale and boat presence/absence

Code to analyze the data