Biological Statistics weeks 1-5 notes
Biological Statistics weeks 1-5 notes STAT 3615
Popular in BIOLOGICAL STATISTICS
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in Statistics
verified elite notetaker
This 11 page Class Notes was uploaded by Rafia notetaker on Tuesday September 27, 2016. The Class Notes belongs to STAT 3615 at Virginia Polytechnic Institute and State University taught by Adam Edwards in Fall 2016. Since its upload, it has received 7 views. For similar materials see BIOLOGICAL STATISTICS in Statistics at Virginia Polytechnic Institute and State University.
Reviews for Biological Statistics weeks 1-5 notes
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/27/16
Exam 1: Study Guide Lecture 2: Rafia Chaudhary Statistics is the art and science of learning from data Individuals (also known as subjects, instances, and observations) are the things we get our data from. Who/what do we get our data from? o These are often people, animals, or other living things but don’t need to be Variable: any characteristic or piece of information about an individual o Is an answer to a question, like how tall are you? What’s your name? etc. Ex: you are working for a drug company and you are developing a new drug to treat the common cold o Individuals: humans o Variables: measure of patient’s symptoms, time it takes for effectiveness, stuff like that A discrete distribution is one in which the data can only take on certain values, for example integers. o Ex: the number of students in class. You can’t “half” a student so they must be whole #s A continuous distribution is one in which data can take on any value within a specified range (which may be infinite). o Ex: a person’s height, time it takes to complete a race, weight, think measurements Nominal scales could simply be called “labels.” With ordinal scales, it is the order of the values is what’s important and significant, but the differences between each one is not really known. Position of something on a list. Small sample size: allows a few extreme observations to skew the conclusions Ann Landers example of a poll she released asking if parents regretted having kids, and 70% of all parents said they regret having children o She had a large sample size of 10,000 o Wasn’t indicative of the American population since not every parent was given a chance to be sampled. o The poll was re-done and 91% of the parents said they would have children again Lurking variables: sometimes our data can lead us to make false conclusions. It’s not lying, it’s just the researches lack of understanding that can cause this. o Ex: post-menopausal woman lack hormones like estrogen, so are there benefits in taking synthetic hormones? o Early studies revealed said that woman who took synthetic hormones seemed to have a reduced risk of heart attack by 35-50% o But what are the lurking variables? Well for one, woman that can afford to take synthetic hormones tend to be richer and more educated. Woman take care of their bodies in many different ways so some woman may have been doing other things to keep up with their health. It may not have been directly related to taking the hormones. o NIH controlled these lurking variables and observed that there is indeed no link between hormones and reduced heart attacks. Variation: variance in the data that we collect o Variation that we can’t account for/explain is referred to as noise o Almost everything varies over time Conclusions are not certain: o No conclusion is ever 100% certain o Conclusions are “on average” statements o Statistics is what gives us a language for talking about and quantifying uncertainty 2 Lecture 3: Distribution: the distribution of a variable tells us what values it takes and how often it takes these values o Distribution can be illustrated with different types of plots o There are also a handful of very useful numbers used to describe the distribution of quantitative variables. o These include: mean, median, quartile, and standard deviation Mean: is the arithmetic average value of data x1+x2+…+x N o ´= N Median: the median is the point of distribution in with half the data points are smaller and half the data points are smaller o order the data from smallest to largest o if N is odd, the mean is the center observation o if N is even, the median is the mean of the two center points A little bit about Mean vs. Median: o A median is robust to outliers, meaning that one extreme observation will not move the median AT ALL o Meanwhile, the mean is NOT robust to outliers. One extreme observation will certainly move the mean. o If a data set has symmetric distribution (a good amount of high/low), then the mean and median will be close together in value o If a data set has a skewed distribution (where most the data lies on one end) then the mean and median will be pretty different in value. 3 o Here, the distribution is symmetrical so the mean and median will be close together o Here are some examples of skewed distributions. The one that have a higher frequency of the small X values is left skewed, and the one with a higher frequency of large X values is right skewed. o Quartile: Quartiles refers to the quarters of your population. o Median: (Q2) is your 2 ndquartile, where 50% of your data is less and 50% of your data is more o First Quartile: (Q1): is located where 25% of your data is less and 75% of your data is more. So between your starting point and your median basically. THE MEDIAN IS NOT INCLUDED WHEN CALCULATING THIS. ONLY DATA SMALLER THAN THE MEDIAN 4 o Third Quartile: (Q3): is located where 75% of your data is less and 25% is more. So basically 25% about the median. Between the median and your last data point. Again, don’t include the median. o Example: find Q1 and Q3 for: 85,91,66,75,88,92,94,77,82,84 Answer : Q1: 77 and Q3: 92 Five Number Summary and IQR: Minimum, Q1, Median, Q3, and Maximum o IQR: inner quartile range = Q3 – Q1 o So for the example above, IQR = 92 – 77 = 15 o This is important for forming boxplots, the box spans the IQR, the line is the median, and the whiskers are the min/max N 2 x −x + x −´ x +…+ x −x ´ 2 ∑ (i−x´) Variance: s = ( 1 ) ( 2 ) ( N ) = i=1 N−1 N−1 Standard Deviation = s= √ 2 o Why do we divide by N-1? It provides an unbiased estimator of variance. The N-1 value is known as Degrees of Freedom Population: is the entire group of individuals about which we want information/data from. Sample: the individuals that are selected from the population that we’ll get our data from. Conclusions will be made about the population, from the sample. It is important that the sample represents the population accurately. Parameter: a value of interest calculated from the POPULATION 5 o P for P Statistic: a value of interest calculated from the SAMPLE o S for S If our sample is representative of the population, we hope that it is close to the parameter. How well does our sample represent our population? o Well only a few studies use samples that are drawn directly from the entire population of interest o It is up to the researcher to modify the population of interest or to argue why the sample he/she chose is representative Sampling Design: the process of selecting subjects that make up the sample. o the design is biased if the sample systematically favors certain outcomes, otherwise it is unbiased Convenience sampling: a poor sample design that chooses individuals that are close at hand which typically underrepresent the total population o Back to the Ann Landers example where she asked her readers if they would have children again. Since this was like a poll in her paper, it’s considered a convenience sample design since not all of her readers responded and the other readers had a choice of whether or not they wanted to respond. Voluntary Response Sample: a sampling design that lets individuals decide whether or not they want to participate, these include write in, call in, or online polls. o Think about it, these types of samples are most likely comprised of individuals who have strong opinions and are willing to take the time to response. o Another example, the teacher evaluation forms you fill out for your teacher—the questions on the back are likely only to be filled out by students with strong opinions since it’s optional. Students that really liked the class or really hated the class are likely to answer. Purposive Sampling: this is the act of intentionally (so, not random) selecting individuals in an attempt to create a representative sample. o this is considered a convenience sample and will suffer some bias o sometimes this is picked as a sample design if the true sample results in data that is too hard to collect. Sampling Frame: a list of the subjects in a population that you can actually sample from o Element: the smallest unit within a sampling frame Probability sampling: each subject in a sampling frame has a chance or probability of being chosen. These probabilities do not need to be evenly distributed. o Samples chosen by chance allow for neither favoritism by the sampler or self- selection by the respondents. 6 SRS—Simple Random Sampling: the fundamental sampling procedure that assigns equal probability to all subjects in the sampling frame and chooses the appropriate sample size. This is a preferred method when roughly all the subjects are the same. It’s like putting names in a hat. o Example: your sampling frame is N = 28 subjects and you randomly choose 7 individuals to be a part of your sample o Sometimes SRS can be difficult to carry out in practice, like what if subjects aren’t uniform? o Example: I ask about the avg. math SAT score, this will likely favor STEM majors and overestimate the score Stratified Random Sampling: This sample design First 1) identifies the “important” groups/strata within a population. Then 2) takes a separate SRS within each group/stratum and finally, 3) combines the samples. o This is important if strata characteristics affect research questions o This is mostly likely to achieve an unbiased sample if the subjects are diverse, it guarantees that each group is represented. Cluster Random Sampling: this sampling method also first identifies strata within the population. This one performs an SRS on the strata. Meaning, you have your strata/groups, and you randomly select, say for example, 2 groups out of 5 groups. Instead of taking just a few individuals from each strata, you take the entire strata. Multistage Sample (SRS with SRS): also includes cluster and stratified sampling o Multistage samples do multiple SRS’s starting with groups/strata and then working down to individuals o It is used when there are a large number of groups. More for practicality o Step 1: SRS of all of the possible groups o Step 2: Another SRS of the individuals chosen in each group o Example: You want to survey the nutrition program in public schools in VA. This requires sending an FDA agent to every school o First option: send an FDA agent to each of the 300 schools in VA o Second option: SRS 1: randomly select 30 counties in VA SRS 2: Then randomly select 10 schools in each county Under coverage: some groups in the population are left out of the sampling frame. Ex: random digit dialing, only works for landlines 7 Non Response: when a selected individual cannot be contacted or refuses to cooperate with data collection Response Bias: when an individual’s answer to a question is not the complete truth. How many sexual partners do you have? How often do you drink? These are really awkward questions that will likely cause people to conceal some truth to make themselves look better/socially acceptable. Also, answers about the past are often inaccurate. This also includes question wording when the question is confusing or it’s a leading question that do not reflect the truth. Response variable: aka the dependent variable. This is the main variable related to what the question is answering Explanatory variable: aka the independent variable this variable may “change” the response Data can be collected by 1) an observational study or 2) a designed experiment Observational Study: study/observe individuals, measure/record variables of interest, but does not attempt to influence the responses. So you’re not really testing anything, you’re just describing what already exists in a group of situation. So there’s no explanatory variable. Ex: smokers, overweight patients, majors o key: we don’t want people to change their behavior because they know they’re being studied; this is a potential source of bias. Ex: your boss lets you know that he will be performing a drug test in two months, think of the bias this will cause… Confounding: when the observed effect of the explanatory variables on response cannot be separated from each other. Observational studies can cause confounding due to the effect of lurking variables. o Ex: studies show that moderate alcohol use is associated with better health. Some suggest wine is better than other alcoholic beverages. If the individuals choose what they drink what are the confounded lurking variables? Possible explanation: people who drink wine tend to be richer and better educated, so they might just have better health in the first place Observational studies allow us to establish some existence of a relationship or correlation. CORRELATION DOES NOT IMPLY CAUSATION (causation: the act of causing something) so just because two variables are correlated, doesn’t mean that response has caused it. Ex: 8 really ridiculous things might have correlation but that doesn’t mean they have causation. The money we spend on science is correlated to the number of deaths by suicide. Types of Observational Studies: o Sample Survey: an observational study that relies on a random sample drawn from the entire population This is useful when describing characteristics of an entire population o Comparative Observational Studies: studies aimed at comparing different populations or comparing individuals within a population exposed to different conditions. Ex: comparing the hearing impairment of blue eyed and brown eyed Dalmatians. o Case control Study: a random sample of individuals with a condition (the cases) that are compared with a random sample of individuals without the condition (the controls). Other than the obvious condition, the individuals should be as identical as possible to reduce confounding. o Retrospective Study: collecting data from events that have happened in the past o Historical-Control: case-control studies that utilize existing data from previous studies to make up the control group. These are fast and convenient but the case and control group are bound to be different. o Cohort Studies: subjects with common demographics are observed at regular intervals over an extended period of time. This is prospective because the data recorded at regular intervals are not looking into the past. This is costly and best suited for common outcomes— when we know some people in the group will get inflicted. This accumulates a lot of information and is less prone to confounding. Although, this is also prone to a loss of subjects over time which can actually increase potential confounding. Lecture 4: Designed Experiments Designed Experiment: deliberately imposing some treatment on individuals in order to observe potential changes in their response. It is a data collection procedure in which the 9 researcher, after accounting for variation, intervenes into the subjects normal behavior an imposes an assigned treatment. o Does the treatment cause a change in response? Subjects: basically the same as individuals—who you study Factors: the explanatory variables in an experiment, your different independent variables Treatment: any specific experimental condition applied to the subjects. Treatments are usually include a combination of specific values of the factors. Why are experimental designs beneficial? They provide more overall control with experimental and environmental variables. You decide the treatments. Interactions: how the setting of one factor affects the setting of the other factor and overall effectiveness. o Ex: High protein – known to cause weight gain. High amino acids –also known to cause weight gain o if a treatment includes feeding chicks with BOTH high protein/amino acid diets, and the results indicate that the chicks did not grow any larger shows that there must be an interaction between the two factors. Example of a bad experimental design: gastric freezing. o Gastric freezing involves swallowing a deflated balloon and later filling it with a cold liquid to reduce pain o Design A: group of patients are all exposed to the treatment and the response variable was a reduction in pain. But can we say that gastric freezing really caused it? NO. the experimental design didn’t have a control (no procedure) and we didn’t include a sham procedure (placebo effect) to see if they got better due to medical care. This is confounding with natural improvement and psychological improvement due to medical care. Ex: of a GOOD design: to remove confounding, include a control and a sham procedure. The control—the no procedure patients; which give us a comparison of what to expect in terms of natural improvement. The sham procedure—this gives us a comparison for whether gastric freezing did anything or if it was just the benefits of receiving medical care (psychological impact of treatment)—aka placebo effect. Experimental Group: the group of individuals receiving treatment Control: a treatment meant to serve as a baseline with which the experimental group is compared. 10 Placebo: a specific type of control treatment that is fake and intended to have no significant effect Randomized, controlled experiment: compares one or more treatments and includes a chance assignment of subjects to treatments. Should include random assignment to form groups that are similar in all respects before proceeding with treatment; crucial for proving causation. Also includes a comparative design that ensures that all other influences (environmental, psychological, etc) are operating equally on all groups. o this allows use to conclude that any difference is due to random chance alone and the differences are due to treatments (causation) o Experiments should have a control, assign randomly, and replicate (have enough subjects to reduce variation) Completely randomized designs: all subjects are randomly allocated to all treatments Matched Pairs Design: compares exactly 2 treatments either by using a pair of closely matched individuals, or by using each individual twice. Then it randomizes two treatments within each pair or randomize the order if it’s the same individual. sometimes it makes sense to group subjects based on similar, known characteristics. Ex: of a block design Block Designs: design in which random assignment of individuals to treatments is done separately within each block. Block: a group of individuals that are known before the experiment to be similar in some way that is expected to impact the effect of the treatments on the response variable. o Treatments are not assigned to blocks, they are assigned to the individuals within the block Other Experimental Considerations: o Blind study: don’t tell the subjects what treatment they’re getting o Double-blind study: neither the subjects nor the people administering the treatments know the treatment 11
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'