New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

PhD Postcomprehensive Registration

by: Perry Reinger

PhD Postcomprehensive Registration 000 000

Marketplace > University of Iowa > Graduate College Post Comp, Etc > 000 000 > PhD Postcomprehensive Registration
Perry Reinger
GPA 3.64


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Graduate College Post Comp, Etc

This 145 page Class Notes was uploaded by Perry Reinger on Friday October 23, 2015. The Class Notes belongs to 000 000 at University of Iowa taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/228002/000-000-university-of-iowa in Graduate College Post Comp, Etc at University of Iowa.

Popular in Graduate College Post Comp, Etc


Reviews for PhD Postcomprehensive Registration


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/23/15
Introduction to Objects Motivation Vast and growing repositories of data available genomic sequence and annotation httpqenomeucscedu httpwwwensemblorq literature httpwwwncbinihqovpubmed expression microarray httpwwwncbinihqovoeoindexcqiqbpro Still being sorted out 100 22000 2200000 data points locally structure httpwwwncbinihoovStructure Exellexis Geoff Duyk does 1000year Stitches 10005 Extrapolates to 3456000day More than 12000000 citations PubMed Searches 45000 40000 35000 Jul01 Jan02 Jul 02 Jan 03 awn99 Jul 89 Janna JuHm JarHH J n97 JM 97 Jan98 Jul98 J Structure MMDB Molecular Modeling DB currently contains 15 000 structure entries corresponding to 35 000 chains and 50 000 3D domains And oh by the way the information was only contained in the literature Therefore in some instances there is strong Interconnection between these data resources but there are also many falseerroneous relationships httpnarouioiournalsorqcqicontentfull301249iikeviG ch9hHh9eskampkevtvoerefampsiteidnar Structure Example Web page pex5 httpwwwncbinihqovStructure Structure viewer Data file What does this all have to do with quotBioinformatics Techniquesquot Accessing data most of this data resides in databases extractingaccessing data manual programatically write programs to perform operations on sets of data that could not be performed any other way download data mirror data process analyze query databases directly for smaller sets quotspiderquot data write a program that quotfollowsquot html links to access data being served in web pages HGMD locally generated data to contribute to community pipeline processing caB I G cancer Biomedical Informatics Grid cabigncinihgov The cancer Biomedical Informatics Grid or caBIG is an informatics infrastructure that is connecting teams of cancer and biomedical researchers together to enable them to better develop and share tools and data in an open environment with common standards caBlG is creating a voluntary virtual network or grid that links individuals and institutions both nationally and internationally effectively forming a World Wide Web of cancer research caBlG will allow researchers to answer research questions more rapidly and efficiently thereby promising to accelerate progress in all aspects of cancer research from etiologic research to prevention early detection and treatment Ultimately because caBlG will provide a common unifying force that facilitates progress in cancer research and care the most important beneficiaries will be cancer patients and the public at large cancer Biomedical Informatics Grid caBIG another model 63 Centers 2007 Model lends itself well to ideas such as quotsharingquot of workflows automatic capture of workflows and autodetection of data tools and analytical resources However there is still a large gap between theory and reality 39 I UUIb A My Dell 943 n A IRAI I AAI Udld m quot lwlm l Mu FF EanEeLo IVES dlnll imam in l llrl lll Anson T4183 1 i l emula m l quotleer 1 Gulf of warm Mexico Modules a brief intro A Perl module is a collection of subroutines and variables that typically implements some common functionality and quotpackagedquot to facilitate reusability Modules exist to perform some of these tasks access databases Ensembl read data formats GenBank records BLAST files specialcustom processing generate an image of a database Example of a Module Invocation Arguments From unix grep pattern file grep is the application pattern is an invocation argument and file is an invocation argument grep foreach arrayArraypl foreach i numbers i becomes a reference to an array foreach j i dereferences i an reference to an aray DEMO grep foreach arrayArraypl Invocation Arguments command line parameters Run program programplflag1 flagZ param1 param2 Ex translatepl sequence 50 10 Special variable all command line parameters are automatically stored in array ARGV Behaves just like a normal array Invocation Arguments in Perl ARGV variables captures invocation variables when program is executed from command line lusrbinperl quotThe invocation arguments werenquot foreach i ARGV print quotinquot 12 How do we implement something like 39grep39 that uses invocation arguments usrbinper grepp example of invocation arguments pattern shift ARGV get the pattern file shift ARGV get the file name openFHfile whileine ltFHgt ifine mpattern print quotIinequot INC how Perl quotfindsquot things the quotsearch pathquot is given in a special variable called INC perl V lots of stuff INC hometabraunperbioperlIivelensemblmodules hometabraunperbioperlIivelbioperIO7 hometabraunperbioperlIivelensemblexternalmodules hometabraunperlbioperIIivelensembIIitemodules lusrIibperl5580i386 Iinuxthreadmulti usribper5580 lusrIibperl5siteper580i386 Iinuxthreadmulti lusrIibperl5siteper580 usribper5siteperl lusrIibperl5vendorper580i386 Iinuxthreadmulti lusrIibperl5vendorper580 lusrIibperl5vendorper lusrIibperl5580i386 Iinuxthreadmulti usribper5580 Dot Sample Output DEMO grepp foreach arrayArraypI grepp foreach arrayArraypI foreach i numbers i becomes a reference to an array foreach j i dereferences i an reference to an aray Invocation Arguments The quotnquot flag grep n foreach arrayArrayp 19foreach i numbers i becomes a reference to an array 21 foreach j i dereferences i an reference to an aray grep foreach arrayArrayp n 19foreach i numbers i becomes a reference to an array 21 foreach j i dereferences i an reference to an aray Added Complexity lusrbinperl grep1p example of invocation arguments inenums 0 pattern 0 file 0 this ASSUMES correct ORDER of arguments whieparam shift ARGV ifparam eq quotnquot Iinenums 1 flag to be used to generate line numbers elsifpattern pattern param elsif fie file param else print quotErrornquot exit1 count 0 openFHfile whileline ltFHgt count ifline mpattern iflinenums print quotcount linequot else print quotlinequot DEMO grep1p foreach arrayArraypl 17 Perl Module GetoptLong lusrbinperl grep2pl example of invocation arguments use GetoptLong my linenums 0 my pattern 0 my le 0 ampGetOptionsquotnquot gt linenums n is command line argument no value linenums set to 1 quotesquot gt pattern e is command line argument 8 for string stored in pattern quotfsquot gt le f is command line argument 8 for string stored in file i for integer count 0 openFHfile whileline ltFHgt count ifline mpattern iflinenums print quotcount linequot else print quotlinequot man GetoptLong GetoptLong3 perl v581 GetoptLong3 Perl Programmers Reference Guide Perl Programmers Reference Guid 200309 02 NAME GetoptLong Extended processing of command line options SYNOPSIS use GetoptLong my data quotfiedatquot my length 24 my verbose result GetOptions quotlengthiquot gt length numeric quotfiesquot gt data string quotverbosequot gt verbose flag DESCRIPTION The GetoptLong module implements an extended getopt function called GetOptions This function adheres to the POSIX syntax for command 19 Getopt Long Parses command line arguments off of ARGV result GetOptions quotlengthiquot gt length numeric quotfiesquot gtdata string quotverbosequot gt verbose flag length is command line argument expecting integer length is variable file is command line argument expecting string data is variable verbose is command line argument no values verbose is variable gets a 1 option variables 20 Sample Output DEMO grep2p n f arrayArrayp e foreach grep2p n f arrayArrayp e foreach 19 foreach i numbers i becomes a reference to an array 21 foreach ji dereferences i an reference to an aray 21 Building Larger Programs Write a subroutine called printformatedsequence input is a sequence string and line length output 1 ALLLLLLLLJ JJJJJJJJJJ JJJJJJJJJJ JJJJJJJJAC LLLLLLLLLL 51 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 9999999999 100 101 9999999999 9999999999 9999999999 9999999999 9999999999 150 151 9999999999 9999999999 9999999999 9999999999 9 191 Now I use this subroutine in 10039s of programs Later determine that is annoying how the first line is offset from second etc To fix this I have to edit 10039s of programs Need a way to save the code ONCE and share with 10039s of programs 22 Sharing Code Library of sequence processing subroutines in a file called quotsequencepropquot These subroutines may be accessed by quotdoquot do quotsequencepropquot This is the same as if the subroutine was in the new program formattedseqp usrbinper formattedseqp do quotsequencepropquot seq quotATGCCCCCCCGGGCGquot ampprintformattedsequenceseq50 23 Packages Modules Libraries software Library generic term for shared code subroutines variables Package Perl declaration often called a Perl Module Library or Perl Module or Module do filename incorporate code subroutines variables into a program require filename incorporate code subroutines variables into a program with checking of duplication FILE incorporate code that has been declared as a module Filepm require MySeqpl use Seq Seqpm is a module file 24 22830105 Statistical Methods and Computing Probability Distributions Lecture 10 Feb 237 2009 Kate Cowles 374 SH7 33570727 kcowles statuiowaedu 3 o discrete random variable a random variable for which there exists a discrete set of possible values 7 Example Let X be a random variable that represents the number of episodes of otir ties media in the rst 2 years of a childs life Then X is a discrete random variable which can take on values 07 17 27 2 Random variables 0 variable a characteristic that can be mea sured or categorized takes on different values for different meme bers of a population or of a sample 0 random variable a numeric quantity that takes different values depending on chance ie7 the value of a random variable is a nur meric outcome of a random phenomenon usually denoted by uppercase letters near end of alphabet 4 0 continuous random variable arandom variable that is not discrete7 ie7 whose values naturally from a continuum Example Let X be the random variable that measures cumulative lifetime expo sure of shipyard workers to radiation In one study7 values of this random variable varied from 0000 rem to 91414 rem7 which may be regarded as takin on an essentially in nite number of values Probability distributions 0 A probability distribution describes the behavior of a random variable 0 The probability distribution of a discrete ran dom variable speci es all the possible values of the variable7 along with the probability of each 7 Assigning probabilities to intervals of outcomes For continuous random variable7 possible vale ues are on a continuum We can not assign an individual probabilr ity to each possible value because there are in nitely many possible values 7 solution use a density curve to assign probe abilities to intervals of values 7 Example Suppose we draw a birth record at random from a nationwide medical database What is the probability that the birth weight of the infant was between 80 and 96 ounces or between 5 and 6 pounds 7 Example number of middle ear infections experienced by children between the ages of birth and 2 years x PX x 0 0129 l 0264 2 0271 3 0185 4 0095 5 0039 6 0017 Note that the probabilities for all the pos sible values must sum to 1 Using density curves to describe the distribution of values of a quantita tive variable 7 density curve 7 a curve that describes the overall pattern of a distribution 7 total area under a probability density curve is 10 7 the curve never drops below the horizontal axis 7 normal curve is only one example 9 Measures of center and spread can be used to describe density curves 7 To distinguish between these measures in the idealized curve vs in actual sample data7 we use different symbols gtk n for the mean of a density curve gtk a for the standard deviation of a den sity curve 11 H 1 Call the variable representing a womanls dbp X7 and call the speci c value for an individual woman m X has a normal dis tribution with n 77 and a 1161 We want to compute to compute the proporr tion of women such that 60 X 100 3 1 Standardize z to produce z7 a draw from a standard normal distribution 60ltX 100 100 7 77 60777 lt X47 lt 116 7147 g Z g 198 Example For women in the US between 18 and 74 years of age7 diastolic blood pressure follows a normal distribution with mean is n 77 mm Hg and standard deviation 0 116 mm Hg We want to know the proportion of US women in this age group who have dbp between 60 and 1001 31 Use Table A to nd the proportion of Z values 3 71477 which 0708 i and the proportion of Z values 3 1987 which 97611 41 So the percent of women with diastolic blood pressure between 60 and 100 is about 97161 e 708 905 13 Normal calculations going the other direction What is the value of dbp such that 10 of women have values greater than or equal to it 1 Use Table A to nd the Zescore such that 10 of a standard normal population would have values greater than or equal to it This is the same value such that 90 of values are less than or equal to it7 namely 128 2 Convert z 128 into xi x M z 077 95 116 128 x 77 116128 x 9185 15 we have been using normal distribu tions to describe populations 7 we can use Zescores to compare values from 2 different populations that follow normal distributions 7 example gtk Former NBA superstar Michael Jordan is 78 in tall gtk WNBA basketball player Rebecca Lobo is 76 in tall gtk Data from the National Health Survey indicate that men7s heights ar e roughly normally distributed with a mean of 690 in and standard deviation of 28 in7 while women7s heights are approximately nor mally distributed with mean 636 in and standard deviation 25 in General formula for unstandardizing a zscore z Mza gtk ls Michael Jordan7s height among men more or less extreme than Rebecca Lobds height among women Show any calcur lations that justify your answer 17 Another density curve for continu ous outcomes The uniform density curve distributes prob ability evenly over the interval from 0 to 1 0 Where do probability distributions come from lnformally7 we may think of a probability distribution as a model based on an in nitely large sample 7 In some cases there is previous data on the same type of random variable in a large enough number of observations that we can use it to compute a probability dis tribution In other cases7 we try to use a wellrknown theoretical probability distribution and see how well it ts with some sample data Review of terminology A population is the entire set of items that we would like to investigate or draw conclusions about A population parameter is a numericl quan tity that describes a characteristic of a popula tion 0 The exact value of a parameter can be ob tained only if the values of a variable are known for every single item in a population 0 Population parameters are usually designated by Greek letters 18 The location and spread of a random variable 0 population mean or expected value of a random variable the average value as sumed by a random variable analogous to the arithmetic mean i in a sample 0 population variance and population stan dard deviation are measures of the disperr sion of values of the random variable around the population mean 20 A sample is a subset of items that is selected from a population 0 The sample is of a manageable size7 so we can actually measure the values of the variable of interest for all members of the sample A simple random sample is a sample drawn in such a way that every item in the population has an equal chance of being selected Statistical inference is the process of drawr ing conclusions and making decisions about a population based on information contained in a sample drawn from that population 21 The methods of statistical inference that we will study in this class assume that the sample being used is a simple random sample A sample statistic is a number that can be computed from sample data without our having to know any unknown parameters We often use a statistic to estimate the value of an unknown parameter Example 2 We wish to study body fat levels in Chinese adult males The particular variable that we are interested in is the upper arm skinfold thickness in mmi o The population of interest is all Chinese males aged 187241 There are approximately 30070007000 of themi o The parameter of interest is M the popula tion mean of upper arm skinfold thickness We will never know the exact value of this pa rameter because we cannot measure all meme bers of the population oWe will take a random sample of Chinese males and determine the sample mean i of upper arm skinfold thickness 0 We will use i to estimate the unknown a 22 Example 1 The Current Population Survey reported the mean income of the sample of households they interviewed to be i 497692 0 The number 497692 is a statistic because it describes the particular sample of households included in the CPS o The population that the poll wants to draw conclusions about is all 103 million US house holds 0 The parameter of interest is the mean in come of all these 103 million households We do not know the value of this parameter 55035 Computer Architecture and Organization Lecture 8 551035 Computer Architecture and Organization 1 Outline 1 Virtual Memory I Basics I Address Translation I Cache vs VM I Paging I Replacement I TLBS I Segmentation I Page Tables 551035 Computer Architecture and Organization The Full Memory Hierarchy Capacity Access Time Cost CPU Registers 1005 Bytes lt10 ns Cache K Bytes 10100 ns 101 centsbit Staging Xfer Unit Instr Operands progcompiler 18 bytes BI k cache cntl 0 5 8128 bytes Main Memory M Bytes Memory 200ns 500ns 000100001 cents lbit OS Disk 4K16K bytes G Bytes 10 ms 10000000 ns I Disk I 5 6 10 10 centsbit I Files useroperator ytes Tape infinite secmin Tape 10 398 551035 Computer Architecture and Organization Upper Level A faster V 7 Larger Lower Level Virtual Memory El Some facts of computer life I Computers run lots of processes simultaneously I No full address space of memory for each process I Must share smaller amounts of physical memory among many processes El Virtual memory is the answer I Divides physical memory into blocks assigns them to different processes 551035 Computer Architecture and Organization Virtual Memory 1 Virtual memory VM allows main memory DRAM to act like a cache for secondary storage magnetic disk El VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory or on disk Compiler assigns data to a virtual address VA translated to a realphysical somewhere in memory allows any program to run anywhere where is determined by a particular machine OS 551035 Computer Architecture and Organization VM Bene t 1 VM provides the following benefits I Allows multiple programs to share the same physical memory I Allows programmers to write code as though they have a very large amount of main memory I Automatically handles bringing in data from disk 551035 Computer Architecture and Organization Virtual Memory Basics 1 Programs reference virtual addresses in a nonexistent memory I These are then translated into real physical addresses I Virtual address space may be bigger than physical address space 1 Divide physical memory into blocks called pages I Anywhere from 512 to 16MB 4k typical 1 Virtualtophysical translation by indexed table lookup I Add another cache for recent translations the TLB El Invisible to the programmer I Looks to your application like you have a lot of memory 551035 Computer Architecture and Organization VM Page Mapping E Process 1 s Virtual Address 0 Space Process 2 s z Disk Virtual E E Address Space Physical Memory 551035 Computer Architecture and Organization 8 VM Address Translation 20 bits 12 bits L092 of Virtual page number I Page offset I pagesize Perprocess page table Valid bit Page Protection bits Table I J Dirty bt base Reference bit Physical page numberi Page offset 1 To physical memory 551035 Computer Architecture and Organization Example of Virtual memory El Virtual Add res Physical Relieves problem of Addres making a program that Physical Main Memory 0 was too large to fit In 4 39 gt 4K physical memory 8 8K 12 12K well fIt I 16K Allows program to run In Virtual Memory 20K any location in physical 24K memory 28K I called relocation I Really useful as you might want to run same program on lots machines Disk Logical program is in contiguous VA space here consists of 4 pages A B C D The physical location of the 3 pages 3 are in main memory and 1 is located on the disk 551035 Computer Architecture and Organization 10 Cache terms VS VM terms So some definitions analogies l A gage or segmen of memo is analogous to a block in a cache I A e fault or address fault is ana ca e miss so if we go to main memory and our data real IPhYSICaI isn t there we need to get it from disk memory 551035 Computer Architecture and Organization 11 More de nitions and cache comparisons I These are more definitions than analogies I With VM CPU produces virtual addresses that are translated by a combination of HWSW to physical addresses I The physical addresses access main memory El The process described above is called memog mapping or address translation 551035 Computer Architecture and Organization 12 Cache VS VM comparisons 12 Access time Transfer time 660 clock cycles 240 clock cycles Parameter Firstlevel cache Virtual memory Block page 12128 bytes 409665536 bytes srze Hit time 12 clock cycles 40100 clock cycles Miss penalty 8100 clock cycles 700000 6000000 clock cycles 500000 4000000 clock cycles 200000 2000000 clock cycles Miss rate 05 10 000001 0001 Data memory size 0016 1 MB 4MB 4GB 551035 Computer Architecture and Organization 13 Cache VS VM comparisons 22 1 Replacement policy I Replacement on cache misses primarily controlled by hardware I Replacement with VM ie which page do I replace usually controlled by OS 1 Because of bigger miss penalty want to make the right choice El Sizes I Size of processor address determines size of VM I Cache size independent of processor address size 551035 Computer Architecture and Organization 14 Virtual Memory nTiming s tough with virtual memory I AMAT Tmem 139h Tdisk 100nS 1h 25000000nS u h hit rate had to be incredibly almost unattainably close to perfect to work Elso VM is a cache but an odd one 551035 Computer Architecture and Organization 15 Pages 551035 Computer Architecture and Organization Paging Hardware Physical How big is a page Memory How big is the page table 551035 Computer Architecture and Organization Address Translation in a Paging System Virtual Address Program Register Page Table Frame Paging 551035 Computer Architecture and Organization Offset Main Memory 39 Page Frame HOW big is a page table El Suppose I 32 bit architecture I Page size 4 kilobytes I Therefore I0000 I0000 I0000 I0000 I0000 I0000 I0000 I0000 I 0000 I0000 I0000 Y Page Number 220 Offset 212 Test Yourself A processor asks for the contents of virtual memory address 0X10020 The paging scheme in use breaks this into a VPN of 0x10 and an offset of 0X020 PTR a CPU register that holds the address of the page table has a value of 0X100 indicating that this process s page table starts at location 0X100 The machine uses word addressing and the page table entries are each one word long VPN OFFSET IMemory Reference 0x010 0x020 551035 Computer Architecture and Organization Test Yourself ADDR OXOOOOO 0X00100 0X00110 0X00120 0X00130 0X00145 0X10000 0X10020 0X22000 0X22020 0X45000 0X45020 CONTENTS OXOOOOO 0X00010 0X00022 0X00045 0X00078 0X00010 OXO3333 0X04444 OXOllll 0X02222 0X05555 0X06666 VPN OFFSET IMemory Reference 0x010 0x020 915 9 h3 What is the physical address calculated 10020 22020 45000 45020 none of the above 551035 Computer Architecture and Organization Test Yourself ADDR OXOOOOO 0X00100 0X00110 0X00120 0X00130 0X00145 0X10000 0X10020 0X22000 0X22020 0X45000 0X45020 CONTENTS OXOOOOO 0X00010 0X00022 0X00045 0X00078 0X00010 OXO3333 0X04444 OXOllll 0X02222 0X05555 0X06666 VPN OFFSET I Memory Reference 0x010 0x020 returned to the processor What is the physical address calculated What is the contents of this address How many memory accesses in total were required to obtain the contents of the desired address 551035 Computer Architecture and Organization Another Example Logical memory mummthI Io UOSSHWUH39D QOU W I01 I01 I I110 I 551035 Computer Architecture and Organization Physical memory ooummthIIo 39UOUBIIwuII 39D l l39hm QOU W Replacement policies Block replacement 1 Which block should be replaced on a virtual memory miss I Again we ll stick with the strategy that it s a good thing to eliminate page faults l Therefore we want to replace the LRU block a Many machines use a use or reference bit a Periodically reset n Gives the 08 an estimation of which pages are referenced 551035 Computer Architecture and Organization 25 Writing a block 1 What happens on a write I We don t even want to think about a write through policy a Time with accesses VM hard disk etc is so great that this is not practical l Instead a write back policy is used with a dirty bit to tell if a block has been written 551035 Computer Architecture and Organization 26 Mechanism VS Policy u Mechanism I paging hardware I trap on page fault 1 Policy I fetch policy when should we bring in the pages of a process a 1 load all pages at the start of the process a 2 load only on demand demand paging I replacement policy which page should we evict given a shortage of frames 551035 Computer Architecture and Organization 27 Replacement Policy 1 Given a full physical memory which page should we evict El What policy I Random I FIFO Firstinfirstout I LRU LeastRecentlyUsed I MRU MostRecentlyUsed I OPT willnotbeusedfarthestinfuture 551035 Computer Architecture and Organization 28 Replacement Policy Simulation 1 example sequence of page numbers 012342237123 1 FIFO 1 LRU u OPT 1 How do you keep track of LRU info another data structure question 551035 Computer Architecture and Organization Page tables and lookups n 1 it s slow We ve turned every access to memory into two accesses to memory I solution add a specialized cache called a translation lookaside buffer TLB inside the processor n2 it s still huge I even worse we re ultimately going to have a page table for every process Suppose 1024 processes that s 4GB of page tables 551035 Computer Architecture and Organization 30 PagingVM 13 Operating System page Table I 551035 Computer Architecture an Physical Memory d Organization PagingVM 23 operating 1 Physical J SyStem Memory Place page table in physical memory However this doubles the time per memory access 551035 Computer Architecture and Organization 32 PagingVM 33 Operating System Physical Memory Cache Specialpurpose cache for translations Historically called the TLB Translation Lookaside Buffer 551035 Computer Architecture and Organization 33 Translation Cache Just like any other cache the TLB can be organized as fully associative set associative or direct mapped TLBs are usually small typically not more than 128 256 entries even on high end machines This permits fully associative lookup on these machines Most midrange machines use small nway set associative organizations Note 128256 entries times 4KB16KBlentry is only 512KB4MB the L2 cache is often bigger than the span of the TLB Mt VA A M 39 TLB 39 Mam CPU Lookup Che Memory n ssl h 39 772undanbn wifh a TLB Trans la on I data 551035 Computer Architecture and Organization Translation Cache A way to speed up translation is to use a special cache of recently used page table entries this has many names but the most frequently used is Translation Lookaside Buffer or TLB Virtual Page Physical Frame Dirty Ref Valid Access Hl tag Really just a cache a specialpurpose cache on the page table mappings TLB access time comparable to cache access time much less than main memory access time 551035 Computer Architecture and Organization 35 An example of a TLB Page frame Page address Offset Readwrite policies and permissions lt30gt lt13gt o 0 V R W Tag PhysAddr lt1gtlt2gtlt2gt lt3ogt lt21gt Loworder 13 bits of addr lt13gt 4 a 34bit physical bits of addr lt21gt I 551035 Computer Architecture and Organization 36 The big picture and TLBs I Address translation is usually on the critical path 1 which determines the clock cycle time of the uP I Even in the simplest cache TLB values must be read and compared I TLB is usually smaller and faster than the cache addresstag memory a This way multiple TLB reads don t increase the cache hit time I TLB accesses are usually pipelined bc its so important 551035 Computer Architecture and Organization 37 The big picture and TLBs Virtual Address No Try to read from page table Replace page from disk Cache miss Deliver data stall to CPU 551035 Computer Architecture and Organization 38 Pages are Cached in a Virtual Memory System 1 Can Ask the Same Four Questions we did about caches 1 Q1 Block Placement I choice lower miss rates and complex placement or vice versa i miss penalty is huge n so choose low miss rate gt place page anywhere in physical memory a similar to fully associative cache model n 02 Block Addressing use additional data structure I fixed size pages use a page table a virtual page number gt physical page number and concatenate offset a tag bit to indicate presence in main memory 551035 Computer Architecture and Organization 39 Normal Page Tables 1 Size is number of virtual pages 1 Purpose is to hold the translation of VPN to PPN I Permits ease of page relocation I Make sure to keep tags to indicate page is mapped 1 Potential problem I Consider 32bit virtual address and 4k pages I 4GB4KB 1MW required just forthe page table I Might have to page in the page table a Consider how the problem gets worse on 64bit machines with even larger virtual address spaces I Might have multiIevel page tables 551035 Computer Architecture and Organization Inverted Page Tables I Similar to a setassociative mechanism I Make the page table reflect the of physical pages not virtual I Use a hash mechanism a virtual page number gt HPN index into inverted page table 1 Compare virtual page number with the tag to make sure it is the one you want u if yes check to see that it is in memory OK if yes if not page fault in If not miss go to full page table on disk to get new entry implies 2 disk accesses in the worst case trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table 551035 Computer Architecture and Organization 41 Inverted Page Table Page Offset Ony store entries for pages in physical memory Page Frame V OK Frame Offset 551035 Computer Architecture and Organization 42 Address Translation Reality El The translation process using page tables takes too long El Use a cache to hold recent translations l Translation Lookaside Buffer n Typically 81024 entries El Block size same as a page table entry 1 or 2 words El Only holds translations for pages in memory a 1 cycle hit time El Highly or fully associative El Miss rate lt 1 a Miss goes to main memory where the whole page table lives 1 Must be purged on a process switch 551035 Computer Architecture and Organization 43 Back to the 4 Questions El Q3 Block Replacement pages in physical memory I LRU is best n 80 use it to minimize the horrible miss penalty I However real LRU is expensive El Page table contains a use tag a On access the use tag is set El OS checks them every so often records what it sees and resets them all El On a miss the OS decides who has been used the least I Basic strategy Miss penalty is so huge you can spend a few OS cycles to help reduce the miss rate 551035 Computer Architecture and Organization 44 Last Question 1 Q4 Write Policy I Always writeback 1 Due to the access time of the disk 1 So you need to keep tags to show when pages are dirty and need to be written back to disk when they re swapped out I Anything else is pretty silly I Remember the disk is SLOW 551035 Computer Architecture and Organization 45 Page Sizes I An architectural choice I Large pages are good u reduces page table size n amortizes the long disk access a if spatial locality is good then hit rate will improve I Large pages are bad a more internal fragmentation if everything is random each structure s last page is only half full Half of bigger is still bigger if there are 3 structures per process text heap and control stack then 15 pages are wasted for each process a process start up time takes longer since at least 1 page of each type is required to prior to start transfer time penalty aspect is higher 551035 Computer Architecture and Organization 46 More on TLBs u The TLB must be on chip I otherwise it is worthless I small TLB s are worthless anyway I large TLB s are expensive 1 high associativity is likely a gt Price of CPU s is going up I OK as long as performance goes up faster 551035 Computer Architecture and Organization 47 Selecting a Page Size I Reasons for larger page size n Page table size is inversely proportional to the page size therefore memory saved a Fast cache hit time easy when cache size lt page size VA caches bigger page makes this feasible as cache size grows a Transferring larger pages to or from secondary storage possibly over a network is more efficient a Number of TLB entries are restricted by clock cycle time so a larger page size maps more memory thereby reducing TLB misses I Reasons for a smaller page size I Want to avoid internal fragmentation don t waste storage data must be contiguous within page i Quicker process start for small processes don t need to bring in more memory than needed 551035 Computer Architecture and Organization 48 Memory Protection I V th multiprogramming a computer is shared by several programs or processes running concurrently i Need to provide protection in Need to allow sharing I Mechanisms for providing protection 3 Provide Base and Bound registers Base u Address u Bound 1 Provide both user and supervisor operating system modes 1 Provide CPU state that the user can read but cannot write Branch and bounds registers usersupervisor bit exception bits 1 Provide method to go from user to supervisor mode and vice versa system call user to supervisor system return supervisor to user i Provide permissions for each flag or segment in memory 551035 Computer Architecture and Organization 49 Pitfall Address space to small I One of the biggest mistakes than can be made when designing an architecture is to devote to few bits to the address ii address size limits the size of virtual memory ii difficult to change since many components depend on it eg PC registers effectiveaddress calculations I As program size increases larger and larger address sizes are needed 6 8 bit Intel 8080 1975 n 16 bit Intel 8086 1978 n 24 bit Intel 80286 1982 n 32 bit Intel 80386 1985 n 64 bit Intel Merced 1998 551035 Computer Architecture and Organization Virtual Memory Summary I Virtual memory VM allows main memory DRAM to act like a cache for secondary storage magnetic disk I The large miss penalty of virtual memory leads to different stategies from cache a Fully associative TB PT LRU Writeback I Designed as n paged fixed size blocks a segmented variable size blocks a hybrid segmented paging or multiple page sizes I Avoid small address size 551035 Computer Architecture and Organization 51 Summary 2 Typical Choices Option TLB L1 Cache L2 Cache VM page Block Size 48 bytes 1 PTE 432 bytes 32256 bytes 4k16k bytes Hit Time 1 cycle 12 cycles 615 cycles 10100 cycles Miss Penalty 1030 cycles 866 cycles 30200 cycles 700k6M cycles Local Miss Rate 1 2 5 20 13 15 00001 001 Size 32B 8KB 1 128 KB 256KB 16MB Backing Store L1 Cache L2 Cache DRAM Disks Q1 Block Fully or set DM DM or SA Fully associative Placement associative 02 Block ID Tagblock Tagblock Tagblock Table Q3 Block Random not NA For DM Random if SA LRULFU Replacement last Q4 Writes Flush on PTE Through or Writeback Writeback write back 551035 Computer Architecture and Organization Future Role of Genome Amplification Testing in Transfusion Medicine C Michael Knudson MD PhD DeGowin Blood Center University of Iowa Hospitals and Clinics Department of Pathology Feb 1999 Slides originally obtained from Susan L Strarner PhD American Red Cross Future Role of Genome Ampli cation Testing in Transfusion Medicine Outline 9 Background information 9 Discuss Window period 9 Discuss IND by ARC and GenProbe O Pooling Methodologies O GAT testing methods What is GAT GAT Genomic Ampli cation Testing Equivalent to NAT Nucleic Ampli cation Testing Use Molecular Biology techniques to detect the presence of DNARNA from infectious agents Why GAT O Sensitive and Speci c if done properly 9 Does not depend on the immune response 9 Decreases the Window Period for detection of an infection 9 High sensitivity allows pooling which decreases number of samples tested 9 To meet the European regulatory requirements 7199 for GATtested plasma for further manufacture Factors which account for residual risk of viral infection following a blood transfusion O Donation during the Window period 9 Failure of donors to generate immune response 9 Infection by variant Viruses which do not illicit a detecable immune response 9 Testing Failure Definition of Window Periods WP WP 1 WP 2 Serolo ic Exposure gt Infect1v1ty gt Detectigon Edipse Viremia re 113311111 Are Viremic p donations infectious absence of infectiVity Are nonViremic donations infectious Example of HI V Window Period HIV PCR Quantitiation 1000000 100000 10000 1000 2 Day 100 439 quot 10 20 30 40 50 60 230 E25 E20 35 0 PCR Days HIV1 p24 Antigen 39I HIV 12 Antibody OOS Example of H CV window period 10000000 g C 2 1000000 g I 39 I E 39 E 6 100000 E 5 O 05 10000 g 0 l gt 0 1000 g I E 100 0 PCR Davs Antibody OOS Window Period Risk Virus Window d Risk 106 HIV 16 148 HTLV 51 156 HCV 82 970 HBV 59 1583 Schreiber et al NEJM 3341685 1996 Projected Impact of Nucleic Acid Testing Reduction in Virus Window period Risk 106 Gain HIV 5 101 23 HCV 59 272 72 HBV 25 912 42 Adapted from Schreiber et al NEJM 3341685 1996 Window Period Donations 9 Estimated number of GATreactive donations detected among seronegative donations HIV HCV 11000000 1100000 9 Therefore annual yield for ARC estimated at HIV HCV 6 60300 Investigational New Drug IND ARC and GenProbe To evaluate the efficacy feasibility and performance characteristics of the GenProbe HIVlHCV TMA assay To meet the European regulatory requirements 7199 for GATtested plasma for further manufacture To initiate testing in a way that does not compromise the availability of blood products but generates information in support of eventual GATbased control of labile products Joint 2part IND Which requires IRB approval Clinical data to be provided to FDA validating specific intended use Two Phases Under IND 9 First phase considered evaluative Conservative policies Assess logistics system impacts 2 One million donations 9 Second phase Dependent upon outcome of first phase Some decisions yet to be made Red cells Platelets Respond to reactive pool Ramp up to include entire system Product Management under the ARC IND 9 Cellular components released Current serological testing 9 Plasma released GAT complete Reactives managed GAT Reactive Donor F ollow Up Study Purpose To de ne the meaning of a GAT reactive result 9 Con rm original GAT reactivity through seroconversion required by IND HIVl maximum of 3 months weekly sampling HCV maximum of 12 months monthly sampling 9 Provide accurate health information to donor Recipient Management under the ARC IND O Recipients if transfused with a GATreactive product Notify only when single donation reactive with FU information provided as available supplemental GAT results FU donor GAT and serology Consider early treatment for HIV HCV O Recipients of prior collections from a GATreactive donor Look back if supplemental GAT con rms reactivity using GATreactive as last seronegative Rational for Pooling of Samples 9 High sensitivity of this testing allows detection of viral nucleic acids even after extensive dilution 9 Low incidence of positive results means most nearly all pools will be negative 9 GAT remains labor intensive so individual sampling is not currently feasible Pooling Scheme H E E Start with 128 samples E E Resolution Testing Reactive Pool Identification Test of Single Primary Donation Alternative approaches to Pooling samples L 7 Each horizontal and vertical row of samples are pooled i l and tested Each sample is therefore tested twice Reactive samples I are identi ed by noting Where a reactive column I I and reactive row or I I intersect I III GA T Steps General Schemes Step One Extraction of Nucleic Acid up Step Two Amplification of Nucleic Acid up Step Three Detection of amplified product Considerations for Extraction Step 9 Unlike antibodies and Viral antigens Viral nucleic acid is more sensitive to degradation during storage 9 Thus may require special tubes and handling procedures 9 Extraction step Will likely be done initially at reference labs Where pooling and automation procedures are being developed Considerations for Amplification Step 9 At least 4 methodologies are competing for this business PCR Polymerase Chain Reaction TMA TranscriptionMediated Ampli cation LCRLigase Chain Reaction bDNA Branched DNA assay 9 Focus on highly conserved regions and ampli cation of multiple regions to assure detection of all Viral variants Considerations for Detection Step O Historically have involved radioactively labeled probes for maximum sensitivity 9 Nonradioactive methods are catching up and will likely be employed for GAT testing 9 Hybridization Protection Assay Will be used by ARCGenProbe under the IND Same technology used for ChlamydiaGonorrhoeae GenProbe HI VIH C VAssay Protocol Step One Sample Processing Extract RNA 90 minutes Hybridized target captured on to microparticles up Step Two TMA Add Amplification Reagent Oil Reagent 10 minutes 41 5 Add Reverse Transcriptase RNA Polymerase 60 minutes 41 5 up Step Three HPA Add Probe Reagent Hybridizes to amplicon 15 minutes 60 Add Selection Reagent 10 minutes 60 Read in Luminometer Till1 ampli cation IND with GenProbeARC RNA Polymerase PromOter Viral RNA DNARNA Hybrid gt Reverse Transcriptase RN AaSC H dsDNA ssDNA RT RNA copies RNA Polymerase 1001000 Hybridization Protection Assay Hybridize Hydrolysis Detection RNA E E L E E AE AE probe H amp AE AE AE AE AE AE AE AE AB Acridinuin Ester Future Issues for Consideration 9 Test methodology chosen 9 Yield vs projected yield Claims of sensitivityspeci city Strategies for replacement 9 Turnaround time of current serologic tests What may we expect ALT Impact of GAT reactives on p24 Ag labile product release 9 Outlook and projected 9 False posmve rate utility of single donation testing GAT testing Conclusions European requirements for plasma testing is driving much of this development The US blood supply is very safe and additional testing will thus provide only modest benefit Pooled GAT testing can achieve Window Period reductions for HIV and HCV Initial testing will be performed on pooled samples at reference labs with slow turn around time Continued improvement in methodologies will likely result in GAT testing on individual components with overnight turnaround time Cohort Data Analysis 171 242243 Section 1 Role of Cohort Studies Brian J Smith PhD April 4 2005 Table of Contents 11 Study Designs 1 111 CrossSectional Study 1 112 CaseControl Study 1 113 Cohort Study 1 Historical Cohort 1 Prospective Cohort 2 12 Historical Role of Cohort Studies 2 121 British Doctors Study 2 Comments 3 122 Bladder Cancer Study in British Chemical Industry4 13 Strengths and Limitations 4 131 Strengths 4 132 Limitations 7 133 Summary 8 14 Implementation 8 15 Interpretation 13 DoseResponse 14 Risk over Time 14 151 Problems with Interpretation 16 16 Proportional Mortality Studies 17 11 Study Designs 111 CrossSectional Study At one point in time data are collected on a sample of the population Exposure and disease prevalence information are obtained and correlations computed Such population correlation or ecological studies are useful in generating interesting hypotheses but are not normally useful in assessing basic causality in an exposuredisease relationship 112 CaseControl Study A sample of individuals with the disease cases and a sample of those without controls make up the study group Then their past exposure experience is obtained retrospectively 113 Cohort Study First identify a study group or cohort of people about whom you will collect exposure information Follow them forward in time and note disease occurrence for each individual Historical Cohort By historical records identify a group with certain exposure characteristics at some specific point of time in the past and then follow them forward towards the present recording their disease experience Example Want to study effects of exposure to levels of a carcinogen which is no longer found in manufacturing and for which historical data exist and in a group which is such a small fraction of the general population that a case control study would miss them Advantage Results may be obtained in a short amount of time Prospective Cohort Assemble cohort in the present and follow them prospectively into the future Advantage Collect exactly that information which is needed The records for a historical cohort study may have been collected for very different reasons and some information may be spotty 12 Historical Role of Cohort Studies Two landmark papers 1 Prospective cohort study of British doctors by Doll and Smith 1954 a preliminary report on tobacco smoking and lung cancer 2 Historical cohort study of Case et al 1954 and Case and Pearson 1954 on bladder cancer in the British chemical industry 121 British Doctors Study Around 1950 results of several casecontrol studies had been published including Doll and Hill 1950 demonstrating an association between lung cancer and cigarette smoking In their 1954 paper Doll and Hill made the case for further prospective studies of the exposure disease relationship stating that In the last five years a number of studies have been made ofthe smoking habits of patients with and without lung cancer All these studies agree in showing that there are more heavy smokers and fewer nonsmokers among patients with lung cancer than among patients with other diseases While therefore the various authors have all shown that there is an association between lung cancer and the amount of tobacco smoked they have differed in their interpretation Some have considered that the only reasonable explanation is that smoking is a factor in the production ofthe disease others have not been prepared to deduce causation and have left the association unexplained Thus a prospective cohort study was begun in 1951 to study lung cancer occurrence in a population whose smoking habits were already known CaseControl Cohort Study Start April 1948 October 1951 Lung Cancers 1488 411 men 27 women Total Enrollment 4342 34440 mean 6194 women Final Results December 1952 1978 men 1980 women References Doll and Hill Doll and Peto 1950 1952 1976 1978 1980 Comments 0 The casecontrol design was cheaper quicker and able to enroll more cases o The cohort design acquired more detailed information on health effects of smoking 122 Bladder Cancer Study in British Chemical Industry The purpose was to determine whether the manufacture or use of aniline benzidine Bnaphthylamine or oc naphthylamine could be shown to produce tumors of the urinary bladder in exposed males The cohort design was chosen because 0 Only a small percentage of all bladder cancers are due to the chemical industry A general casecontrol design would be uninformative 0 Answer needed urgently current exposure levels were less than past exposure levels A prospective cohort study wouldn t work Historical cohort study was the only possible approach 13 Strengths and Limitations 131 Strengths This section gives the strengths of the cohort study relative to the casecontrol design 1 Cohort study is better at establishing full range of health effects related to a particular exposure After all cohort study starts with exposed and unexposed subjects follows them through time and records all disease experiences Casecontrol starts with a particular disease and a backward look at exposure history 2 Biases a Cquot 0 Recall Bias The results of a casecontrol study are questionable if there is a possibility of recall bias Recall bias should not occur in a properly carried out cohort study Precision of Recall Suppose we have an ordinal exposure variable and there is some unbiased random error in the recalled level of exposure Suppose also that the variability of this error differs between cases and controls in a case control study Then the apparent odds ratio can be quite different from unity even when it shouldn t be Again recall bias should be minimized in a cohort study Selection Bias ln casecontrol studies this is possible if a high proportion of those contacted to be population based controls refuse lf hospital controls are used which disease categories are eligible ln cohort studies the healthyworker effect may introduce bias if the employed population is healthier has lower morbidity rates than the unemployed population Also the chances of a highlysensitive individual quitting work in a risky industry are probably higherthan an insensitive individual 80 Percent of General Population Followup Years Figure 1 Evolution of the healthy worker effect following 3 4 01 entry into a study of Swedish building workers Efficiency Cohort studies are more efficient than casecontrol when the exposure is both rare in the general population and responsible for only a small proportion of the cases This latter case rules out efficiency of the casecontrol study the first case adds to this Predisease exposure information may be impossible to determine retrospectively One may need blood samples to determine exposure These are rarely available for casecontrol studies but can be handled routinely in prospective cohort studies Retrospective information may be too inaccurate to be useful eg dietary recall chemical exposure recall 07 V 132 N 00 4 Cohort studies allow serial measurements of exposure This will allow not only presenceabsence of exposure but timedependent levels of exposure This increased accuracy of the exposure over time should improve our inference concerning the exposuredisease relationship Casecontrol studies are good for estimating odds ratios If one also wants to know the actual incidence morbidity rates or the absolute risk measurements a cohort study is necessary Limitations Prospective cohort studies require a great commitment over a long period of time Few people or funding agencies have such patience for any but the most important health issues Expensive Variations on cohort design are cheaper eg nested casecontrol and casecohort Historical cohort studies can only be done when the cohort of interest exists and complete accurate information on exposure as well as important confounding variables is available If the disease is sufficiently rare even a very large cohort may not develop a sufficient number of cases to make the cohort approach worth while In this case consideration of the effect or cost per case will favor the casecontrol over the cohort approach Cohorts not representative of the general population cannot give estimates even extrapolated ones of the population attributable risk Populationbased case control studies can estimate this 133 Summary Cohort 0 Provides welldefined population from which cases arise in an unbiased fashion 0 Complete covariate and exposure experience duration times levels etc are available for entire study period CaseControl o Concentrate effort on informative individuals cases and controls for whom extensive information is collected 0 Inexpensive New procedures that combine advantages of these two designs are being developed and implemented 14 Implementation The two main issues to consider when planning a cohort study are 1 Is the planned cohort size adequate for detection of real differences 2 How to implement the study Implementation includes consideration of 1 Inclusionexclusion criteria rules for including and excluding individuals should be clear 2 Dates for subjects 0 Date of enrollment 0 Date of first exposure often different from date of entry 0 Date last seen and vital status 3 Followup mechanism the percentage of individuals lost to followup is a measure of the quality of the study The study will be called into question if that percentage is high The purpose of followup is a Determination of personyears information who is still under observation and who is lost to followup 0 The followup mechanism may vary from country to country c Groupbased cohort labor union insurance plan pension plan professional society etc b Identification of cases 0 Death certificates 0 Cancer registry more accurate more cases more information Table 1 Number of deaths occurring from ve through 35 years after onset of work in an amosite asbestos factory 19411945 Cause of death coded in two different ways a DC death certi cate BE best evidence available b From Hammond et al 1979 c Confirmation of case information 0 Use of additional information to refine death certificate eg Xrays and asbestosrelated disease d Coding of disease 0 World Health Organization WHO members code death certificates according to current International Classification of Disease ICD Disease codes can change from one revision to another Be aware of different codings in a cohort spanning different revision periods e Assessment of disease 0 Coding an exposure variable as yesno is insufficient for a doseresponse relationship cannot infer causality or set safety standards Should quantify level of exposure as much as possible and when exposure occurred for how long and when it stopped Such exposure information is needed on an individual level Mean values for an entire cohort though not valueless cannot give doseresponse estimates 0 Starting and stopping dates of exposure are often easily obtained 0 Exact level of exposure may be difficult especially in historical cohorts One may have to use a categorical measurement of exposure eg low medium high Demonstrating a doseresponse relationship on such an ordinal exposure variable is possible f Information on possible confounding factors 0 Spurious results arise when confounding factors are not adjusted for in the analysis We use the term misclassification to denote that incorrect information has been collected on a variable 0 For dichotomous variables misclassification rates of 30 for the confounder can result in very little of the confounding effect being 11 removed lfthe misclassification rate is 10 then in certain situations nearly half the effect of confounding is still in place 0 Collect as accurate information as possible If this is not possible it may be better to try a less expensive approach like a casecontrol design and spend the extra money and time on gaining more accurate data g Construction of special comparison groups 0 Occasionally one needs to construct a special group apart from the cohort For example cohort consists of smoking and nonsmoking asbestos workers Need two groups smoking and nonsmoking people unexposed to asbestos Unexposed and exposed may be matched in such situations h Power considerations 0 Unless your data are merged into a larger study if your study has too low a power to detect realistic levels of excess risk your study is most likely not worth doing i Other designs i Synthetic casecontrol At each failure time consider the failing person as the case and take a random sample of the rest of the cohort at risk to be the controls Risk set at time tconsists of the failing subjects plus the timematched controls Use Cox regression to analyze ii Casecohort At beginning of study randomly pick a subcohort of the complete cohort The risk set at time t is the intersection of those in the subcohort still at risk and those that have failed Example Women s Health Study was to study 15000 women looking for an association between breast cancer and dietary fat Dietary forms were to be done and blood drawn on a routine basis Cost of dietary coding and blood analyses would cost millions Cheaper if done on a sub cohort of say 2025 of the full cohort plus 5 that develop breast cancer Case cohort design is a natural choice for this 15 Interpretation A discussion of Hill s criteria for assessing whether an association is causal can be found in most introductory Epidemiology text books 1 Strength of association 2 Biologic credibility 3 Consistency with other investigations 4 Time sequence 5 Doseresponse relationship More and more what is expected is notjust qualitative evidence but quantification of the degree of risk Two major aspects of excess risk are the dose response relationship and risk as a function of time DoseResponse Dose response can be assessed when exposure is quantified as a nominal categorical or numerical variable Risk over Time lncidence or mortality rate often are functions of time since exposure eg excess leukemia rates 5 years after radiation or of duration of exposure eg lung cancer incidence rates rise with 4t power of smoking duration among continuing smokers Also of great importance is the change in risk after exposure stops 0 Further evidence of causal relationship 0 Show effect of intervention 80 it is a good idea that the design of a cohort study accounts for subjects that are formerly exposed CAUTION Need to know why someone stopped smoking eg very poor health and physician told them they had to stop Example Women treated by radiation for cancer of the cervix have four times the risk of lung cancer as expected O c Radiotherapy No Radiotherapy 3 o o I oquot a 3 0 o D l X E o 39U I g x a n 3 In o d N d o g o I I I I I I I I lt1 14 59 1014 1519 2024 2529 30 Time Since Diagnosis Years Figure 2 Observed to expected ratios of lung cancer by time since diagnosis of cervical cancer treated with and without radiotherapy Upon first inspection it may seem that the excess lung cancer cases are due to the radiotherapy However when compared to patients treated without radiotherapy the same trend is observed An alternative explanation might be that the excess lung cancers are due to the misclassification of metastases from the original cervical cancen 151 1 N 00 01h Problems with Interpretation Healthy worker effect 0 Can make comparison with external standard population difficulty to interpret Comparisons between different groups within the cohort should be less affected Special consideration should also be given to change in employment status due to ill health eg retire change jobs move to area of lighter work Mortality is often high a year or two after employment change One solution is to lag employment status by 2 or 3 years 0 Analog to healthy worker effect those who respond to guestionnaires In the British doctors study those who failed to respond had greater mortality rates In a NY breast cancer screening trial those accepting invitation had half the risk of mortality as those not accepting Loss to followup 0 Incidence rates can be biased downwards if there are people lost to followup and we don t know that Recall bias and misclassification of exposure rates 0 Cohort studies have the advantage of measuring exposure before disease status is ascertained Lack of information on confounding factors Multiple comparisons 0 Level of test destroyed by number of comparisons A priori you should have a few hypotheses you want to test The rest of the many many things you can test are not strict statistical tests but by products of a hypothesis generating data mining Identification of forerunners of disease rather than causes 07 0 An association that looks causal may only reflect an early state of the disease eg cough is the cause of lung cancer or low serum cholesterol levels in people subsequently developing cancer Conclusions from negative results I Can bias or confounding be ruled out What levels of risk are included within the confidence intervals How do the levels of exposure in the study compare with the levels in other exposed populations Had sufficient time elapsed between the start of exposure and the end of followup Is there any reason to suspect that the cohort is at a lower risk than the general population 0 Are the results consistent with other studies 16 Proportional Mortality Studies Absolute mortality rates unknown eg don t know annual mortality rate for pancreatic cancer but do know proportional mortality rate For example 01 of all deaths were due to pancreatic cancer Could also have proportional incidence rates possibly from cancer registry Study Design CaseControl where cases are persons dying from the disease of interest and controls are selected from persons dying of other causes Advantage Quick cheap look at data May generate some hypotheses in the initial stage of investigation Disadvantage Excess proportion of one cause of death may mean 1 absolute risk increased for that cause and 2 decrease in rate for some other cause For example more hypertensive men survive heart disease and can die of prostate cancer at higher rates Hypertension is not protective of prostate cancer Serious biases are possible Cohort Data Analysis 171 242243 Section 2 Rates and Rate Standardization Brian J Smith PhD April 12 2005 Table of Contents 21 Rates 19 211 Crude Rate 19 212 Calculation of PersonYears 20 213 StratumSpecific Rates 21 22 Rate Standardization 23 Notation 23 221 Direct Standardization 24 External Standard Population 24 Internal Standard Population 25 Comparability of Direct Standardized Rates 25 222 Standard Errors forthe DSR 27 Smelter Workers Example 29 Comments 29 223 Indirect Standardization 30 Down s Syndrome Example 30 Comments 31 23 Comparative Measures of Incidence and Mortality 31 231 Comparative Mortality Figure 32 Comments 33 232 Standard Error of the CMF 33 233 Standardized Mortality Ratio 34 Comments 35 234 Standard Error of the SMR 36 235 Hypothesis Testing for the SMR 37 Conventional Test 37 Exact Method 37 Byar s Method 37 Variance Stabilizing Transformation 37 236 Confidence Intervals forthe SMR 38 Exact Method 38 Byar s Method 39 Comments 40 237 Comparison of CMF and SMR 40 Example 41 Unbiasedness of CMF 42 Biasedness of SMR 42 Comments 43 21 Rates 211 Crude Rate Need to estimate disease rate among cohort members during study period eg incident cases diseae incidence rate personyears at risk Suppose there are N subjects in the cohort and the lth subject is at risk for n years Then the number of person years at risk for the entire cohort is n 2 Elm lf d individuals are diagnosed with the disease during the study period then the overall or crude incidence rate is A d A cases per personyear n This crude rate ignores any stratification existing within the cohort It is often of interest to calculate the stratum specific rates The cohort may be stratified by age intervals and calendar year periods First we need to be able to calculate the number of personyears at risk in each mwm 212 Calculation of PersonYears Suppose that subjects are stratified by 5year age intervals and 5year calendar periods Consider a subject who entered the study in 19722 at age 246 and exited the study in 19846 at age 370 Age 0 1970 1975 1980 1985 Calendar Year The following table demonstrates the calculation of the subject s contribution to the personyears spent in each stratum where the strata are derived from two factors Of course there could be more than two factors 20 PersorYears Exact Approx Year Age Exact Approx 19722 246 1972 24 19726 250 1972 25 197075 2025 04 05 19750 274 1975 27 197075 2530 24 20 19776 300 1977 30 197580 2550 26 30 19800 324 1980 32 197580 3035 24 20 19826 350 1982 35 198085 3035 26 30 19846 370 1984 37 198085 3540 20 25 Totals 124 130 When using integer dates and ages assign 12 year to first and last years of age and 1 yearto every age in between Someone entering and exiting the same year gets year The exact and approximate methods for computing person years usually produce similar results 213 StratumSpecific Rates Suppose there are j1J strata and let djand n denote the stratumspecific number of incident cases and person years respectively We calculate the number of person years within each stratum as where N is the total number of subjects in the cohort and n is the amount of time the lth person spent in stratumj Then the stratumspecific incidence rate is calculated as 21 If dj represents the number of deaths then this is interpreted as a mortality rate A is an estimate of the true unknown rate A Note that the crude rate is Table 1 Respiratory cancer deaths d personyears at risk n in thousands and death rate 1 per 1000 personyears in a cohort study of Montana smelter workers Age Calendar Period 19381949 19501959 19601969 19701977 Ma39s 4049 d 5 5 7 4 21 Q 9217 14949 16123 9073 49363 1 0542 0334 0434 0441 0425 5059 d 11 24 28 17 80 Q 6421 10223 13663 11504 41811 1 1713 2348 2049 1478 1913 6069 d 14 24 44 35 117 Q 4006 4896 7555 7937 24394 1 3495 4902 5824 4410 4796 7079 d 4 12 15 27 58 Q 1507 1851 2724 3341 9423 1 2654 6483 5506 8081 6155 Totals d 34 65 94 83 276 Q 21151 31920 40066 31855 124991 1 1608 2036 2346 2606 2208 22 22 Rate Standardization The crude rate 2 often depends on the age distribution of the cohort Crude rates of different cohorts cannot be compared if they have different age distributions eg comparing death rates for ischemic heart disease between a predominantly young smoking cohort and a predominantly older nonsmoking cohort Q1 How can we summarize stratum specific rates into a meaningful single rate Q2 Can stratumspecific rates be summarized into an appropriate single rate Suppose for now that the strata are age categories eg 04 59 7579 8084 85 Notation We will use the following notation in our discussion of standardized rates Notation Description 3 Crude rate in the cohort Anng Crude rates in each strata d1 dJ Number of cases in each strata n1nJ Number of personyears in each strata p1pJ Proportion of subjects in each strata 23 A superscript will be used to denote quantities that are based on a standard population 221 Direct Standardization Direct standardization is a method of combining the stratumspecific rates for the age groups so that the age distribution matches some standard population Let p55 denote the proportion of people in the standard population that are in stratumj Then the direct standardized rate lDSRl is J DSR 2 p592 1 External Standard Population One can use an external population as a standard population For example census counts are often used Table 2 Census Bureau 1950 US population per 1000000 Age Population Age Population 0 4 107258 45 49 60190 5 9 85591 50 54 54893 10 14 73785 55 59 48011 15 19 70450 60 64 40210 20 24 76191 65 69 33199 24 Age Population Age Population 25 29 81237 70 74 22641 30 34 76425 75 79 14725 35 39 74629 80 84 7025 40 44 67712 85 3828 Or one could use some other census year other country specific state gender shortened age ranges etc Internal Standard Population If the cohort is large enough one may calculate direct standardized rates for subcohorts using the entire cohort as the standard population For example stratification by two subcohorts exposed and unexposed where the standard population is the complete cohort Comparability of Direct Standardized Rates Suppose we compute the direct standardized rate for two cohorts using the same standard population J osa 2pm 1 J DSR2 Z pig121 1 Scenario 1 If 21 2 ac and 22 bcj for some constants a and b then 25 DSR1 6210520 a DSR2 b2 pgsgtcj b regardless of the standard population that is used Scenario 2 If 21 2 ac and 12 bdj then depending on the choice of standard population DSR1DSR2 may by equal to less than or greaterthan ab Example Consider the following data stratum 21 22 j P p72 1 010 020 13 12 2 020 025 13 13 3 040 020 13 16 The direct standardized rates will differ depending on whether the standard population 31 or 2 is used Using 31 gives DSR1 0103 0203 0403 2 0233 DSR2 0203 0253 0203 2 0217 whereas 32 gives 26 DSR1 0102 0203 0406 2 0183 DSR2 0202 0253 0206 2 0217 Since the stratumspecific rates are not proportional across the two cohorts the relative magnitude of the two DSRs depends on the choice of a standard population 222 Standard Errors for the DSR Standard errors are typically computed under the assumption that the number of incident cases follows a Poisson distribution d Possonnjlj where the expected value and variance are Ed 1 Vardjnjlj39 Under this assumption the estimated variance of the direct standardized rate is 27 Var DSR Var p921 VarZ pESUjnj 2 200539s Vard E Zltp Squot2 quot12 2 Z ZltpESnj2d SEDSR 2p Sgtnj2dj The distribution of the DSR is somewhat skewed For the purposes of computing confidence intervals it s better to use the log scale SEln DSR IR Consequently a Wald 95 confidence interval could be constructed as exp InDSR i196SEnDSR 28


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Jennifer McGill UCSF Med School

"Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.