Special Topics ECE 8833
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 0 page Class Notes was uploaded by Cassidy Effertz on Monday November 2, 2015. The Class Notes belongs to ECE 8833 at Georgia Institute of Technology - Main Campus taught by Christopher Rozell in Fall. Since its upload, it has received 25 views. For similar materials see /class/233879/ece-8833-georgia-institute-of-technology-main-campus in ELECTRICAL AND COMPUTER ENGINEERING at Georgia Institute of Technology - Main Campus.
Reviews for Special Topics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/02/15
ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 21 03312009 Dynamic range as a resource Our sensory systems deal with an enormous dynamic range Vision Moonlight is 01 Cdm2 sunlight is 100000 Cdm2 Single photons can elicit measureable rod response yet we can still see in broad daylight Audition Whisper is 20 dB pain threshold is 140 dB Deflecting hair cells in auditory receptors by the width of an atom can elicit measureable responses but we can still hear in an airplane Gain control Why is the large dynamic range an issue Constraints We need to maintain sensitivity eg contrast Neurons have limited response ranges so we cannot make their firing rates arbitrarily high Gain control is a resource optimization problem requiring nonlinear processing What affect does gain control or normalization have on our information processing Approach 1 Efficient coding Schwartz amp Simoncelli 2001 relate gain control to efficient coding Recall the model from Olshausen amp Field 1996 1013 Zai Mat 9 40031 393 HPWi pa7 oc eCai PUIa oc eIII q a 2 Coefficients found through MAP estimate Alternate formulation using ICA that tries to find basis such that projecting the signal yields independent coefficients Independence How do pairs of filters behave in this model They seem active at the same times Are they really independent Show me the data Conditional histograms for pairs of filter functions Mean still zero but variance changes histoL2L1 01 histolL2lL1z 09 1 What does it mean Consider two RVs with zero mean Correlation p ElXY EXEY fajy paydady yxpltxiygtdxpltygtdy Independence 0 pltwly 1956 Coefficients are uncorrelated but not independent Linear model unable to make independent coefficients in nonlinear environment Where does dependence emerge Same filters for a Bahunn Flowers While noise different Signals 1 i 31 i Dependence reqUIres 39 natural signals 1 i l Different filters for the 39 same signals Dependence depends b onfilterpairs v a 1 9 Dependence occurs with orthogonal 1 nonoverlapping a ww W 4 xk and ICA filters xiii What could be done Variance changes depending on neighboring activity Variance of RV can be scaled with multiplication Normalize response by dividing local activity by neighbor s activity o l Dl ner squared i respanses mm 2 dared soonses r2 Olher sq ller re Model Filter outputs L1 and L2 Variance modulated by neighbor s filter output varL1lL2 ng l 02 Response is normalized to unit variance by dividing L R 1 wL 02 Generalization to multiple neighboring cells E ELEM WP 02 Learn weights and constant over natural stimulus dataset to maximize independence R1 Results Outputs are much more independent 0 r I e Dlher squared mm respunses y z y s y omer squared mler responses Comparison with physiology Divisive normalization implements the automatic gain control necessary for large dynamic ranges Does divisive normalization account for some of the other non linear phenomenon seen in neurons Simulate physiology experiment on a single filter using a set of population filters similar to RFs Nonlinear V1 response properties Classical nonlinearities previously modeled as normalization a can Mode Skouun elal 1957 Orientation tuning changes A with contrast of grating b Cell Model 1 Bonds was Orientation tuning for optimal orientation masked with 39 39 a nother g rating 1 c Cell Model Bonds 1959 Optimal orientation response suppressed by grating at different contrasts 1 Mean wing race A s o a o n 1 Signal cunlras Masking suppression Other nonlinearities not Cell Model previously explained by normalization Stimulus in RF is suppressed b by Stimulus in surround 3amp3 Less suppression when mask is orthogonal c Shape of tuning curve remains quot e39 Due to presence of edges in sti m u l i i n d u ci n g co l i n ea r g 21 5325 correlation d so 20 40 0 Signal lnensity as Signal inlensily dB Analogous result in audition 20 1i soquot Nonlinear tuning curve shapes Some tuning curves a change shape with level 439 Ce MW Cavanaugh el all 2000 Contrast 01 o Response to grating patch 100 changes with contrast E50 E 0 w a 39 D 3 0 3 Diamelermeg Diametermeg Response to tone changes b with loudness Q Cell Model 2 Rose 915L197 w o Decibels SD As signal level varies tradeoff between role of constant and overall input 40 Mean rring rate 0 o n 025 quot712 o 025 112 Relative frequency Relative frequency ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 22 04022009 Redundancy One key to improving dynamic range usage is to have a good predictor of the signal Since natural signals have substantial redundancy we should be able to make effective predictions Srinivasan et al 1982 view the center surround inhibitory retinal structure as predictive coding Centersurround Lateral inhibitioncenter surround postulated as Efficient codingwhitening The results of a resource optimization Srinivasan conjectures predictive coding to maximize limited resolution of individual cells Basic model Basically the same model we have seen several times before 4 39 Fly 5quot Ill 393 II V I h vlHw UJ u H l quot 39lquot I y 39 L Li 3 I ll 11 I H quot ll I w a an r 3 a Q III J l 131 37 Ill u 39 39 l 1 n 7 All I 0 NM l 39lll gt l39quot39339lquotll V quotMl4 ltlll39llquotl39lllll1i lvr 11 u Illul l llll Scene statistics Recall for predictive coding that we need the autocorrelation function Even truly random scenes would induce correlated retinal images due to optics etc 39II I t i U V i I 1 x A A r t x 7 U quotquot39 Q v y y v v v v Computational results Assume values for mean and std of pixel values and noise std Assume autocorrelation has exponential decay Calculate weighting coefficients onto interneuron Essentially compares value at center with prediction from surround E 0014 SN 10 l 5l 4l 3l 2l 1 lo 1 l2 l 3 l 4 l5 l S 003 N 003 M 1 D 5 SN 1 b E 0315 Shape of RFs vs autocorrelation Shape insensitive to autocorrelation decay rates Mostly a function of SNR This is desirable because there are large fluctuations in quot5lc4k31 2W031 2l3l4l5l autocorrelation between b D 10 S 003 Images 1 NO M 1 D 10 E 0009 SN z oo Extends in obvious way to 2D Temporal Can also extend theory in analogous way to use temporal prediction Results in impulse response for neuron sensitivity over time coding 32m n I I II H CO 003 OD ewess 25 o 3 320 H II n 0 26 Comparison to experimental data Do see same general a 1001mquot2 changes In RF shape 399 SN145 with luminance levels 4 06 0 3 Can measure SNRs and 0llLllIllllllllllllll fly eye 1260611 2 0399 bl SN 058 Calculate optimal RFs 06 and compare 0g 0 mm llllllllllllLlIlllll 0 2 4 6 8 10 angle off axisdeg Comparison to experimental data Same approach but for temporal coding 1 SN 06 39 SN 11 a M Q g D a 39 a l 10 ms 10 SN 94 Extensions But each image we see can have local statistics that vary quite a bit from average recall an earlier HW assignment In this case should the linear predictor adapt due to local statistics different from the mean Hosoya et al 2005 look at experiments that show some adaptation can happen within seconds to try and improve prediction under the current statistics Notes for ECE8833 Lecture 10 Optimal estimationinference 2102009 1 Why do we need to estimateinfer things What is the role of the visual system Give us information about our environment Determine the causes of the sensory information we collect Why is this hard Problems are ill posed eg7 projecting a 3 D world onto a 2 D plane There will always be uncertainty7 and we have to make the best guess77 given the information we have available Sometimes we7re wrong7 and that produces illusions Standard model a known hypothesis H that includes a generative model7 this gener ative model has unknown parameters 67 and it produces generates data d Basic example d 6 n where n N N0702 What is best 6d E0 How would a frequentist approach this problem In many ways7 this is the realm of traditional statistics What do we know Fixed 67 what is probability of d Sampling distribution pd67 Can also view this as a function of 6 d xed7 called the likelihood 6 pd67l Maximum likelihood estimate arg max9 6 arg max9pd67l EX d is a vector of L iid estimates where dz39 6 n with n N N07 02 M Hpltdlt gtgt 6d arg m axpd67l arg meax lnpd 67 H d 1 L E lnpd67 l U7 111 612 7 9 A 1 L 9d Z Zola This estimator itself is a random variable How do we evaluate it Bias ls E6d 9 Consistency Does d 7 62 a 0 as L a 00 1 9 4 Ef ciency Does 7 92 achieve the best possible value Traditional statistics often chooses a property eg bias and then tries to develop an estimator to meet that property Example is that sample variance is a biased estimator of variance MLE is by far the most commonly used estimator in this camp MLE is asymptotically unbiased consistent asymptotically ef cient and asymptoti cally Gaussian distributed BUT why do this MLE isn7t optimal in any error criteria and it boils everything down to a single number What if you wanted to know how con dent you should be in that estimate EX Pr0 gt 0 Bayesians view the world differently Use a probability distribution 100 to express your degree of beliefuncertainty about its value They would generally oppose any notion that this means 6 is truly random but this is really a semantic point as the mechanics are the same as if it were a random variable This also gets back to our root question If the goal is to determine what are the likely causes of our sensory data we are really more concerned with pdl97Hp9lH WW pltdiHgt pdl97Hp9lH fpdl97Hp9lHd9 Notation Likelihood when viewed as a function of 0 09 pdl67 l Prior p0l7l Posterior p0ld7l Evidence marginal probability or partition function pdl7l NOTE Like information theory the quantity of interest posterior only depends on the likelihood of the data and not on the data itself Known as the likelihood principle The main complaint about Bayesian techniques is that a lot of assumptions are wrapped up in the prior This is true But a Bayesian would argue that 1 there are a lot of assumptions wrapped up in H too its just a matter of what level you are certain of your model 2 we often do have information we can try to incorporate into the model and 3 the relevance of this prior gets less and less as we get more data Priors usually chosen to 1 re ect knowledge 2 have as little impact as possible or 3 be computationally tractable Computationally pdl7l can be VERY dif cult to calculate Many times we dont need it its just a scalar normalization factor When we do we may need to resort to stochastic integration MCMC How do we form an estimator from p6ld7l7 a Approach 1 Natural to ask what is the most likely value of 6 given d A C7 V Maximum a posteriori MAP estimate 0d arg maX9p0ld7l Revisit earlier example assuming 6 is uniformly distributed immwmm mWWW WW arg meaxpdl07l arg meaxlnpdl07l A 1 L WPZZW i1 In this case MAP and MLE are the same Not the case in general for nite sam ples They have the same asymptotic properties but the MAP can be motivated even in nite sample cases NOTE when you assume the prior is uniform the posterior is proportional to the likelihood NOTE in this case the prior isn7t well de ned over an in nite interval Termed an improper prior NOTE that even through we are reducing things to a single number in the MAP we have a whole distribution at our disposal and can answer some questions about how much con dence we should have in the estimate Narrow distributions have high con dence and wide ones have low con dence To quantify this we would need to actually calculate the partition function Approach 2 Specify loss function Instead of having certain properties what if we wanted to specify a loss function and optimize that If we want to minimize 6 7 21209 led6dd Take derivatives and set equal to zero posterior density Optimal estimator is the mean of the 6p6ldd6 E6ld7 l If we want to minimize l6 7 glp6dl7ld6dd Optimal estimator is the median of the posterior density 1 median p0ld 3 O V If we want to minimize Il97 lgt0p67led6dd Optimal estimator is the max of the posterior density 3 arg meaxpwld H So while MAP does reduce things to one number it is the solution to an optimal estimator for some reasonable loss function and we have explicitly the entire density non asymptotically to use to judge the quality of the estimate Original example from Bayes 1763 A coin has unknown probability f of coming up heads If we toss the coin N times we will get NH heads and NT tails in a sequence 5 For example 5 HHTHTT Clearly IMMNFMG M Assume that f is uniformly distributed over 01 pf 1 for f 6 01 108lf7 N10f N pfl87 MSW Wu 7 WT MSW 1 MM fNH1 fNTdf 0 Beta integral Consider the received data 5 HTH so that NH 2 and NT 1 What is the best estimate of f MAP 10fl87N Olt f21 f mfaXfZU if d 36214 2f73f2 0 f2 7 3f 0 f 0723 So MAP estimate is f 23 MEAN Need partition function NHXNT 7 2x 7 I N1 112 0 p8lf7Npfdf 4 ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 14 02242009 Context and experience matter We ve looked at how Bayesian integration gives a way to optimally integrate observations with experience Shape from shading This prior is adaptable with tactile feedback The importance of motor systems Despite our focus on perception doing things is critical to an organism Example courtesy of Dan Wolpert Sea squirt w Development pattern find a rock permanently attach eat own brain Motor system can be thought of more generally as decision making How do we make decisions Can Bayesian theory be turned around to describe making optimal decisions under uncertainty Nice review of this area by Kording 2007 Decisions lead to outcomes whose goodness is measured through utility Uoutcome Outcomes are uncertain given decisions poutcomexdecisiona Want to maximize expected utility EUdecisionZ poutcomeldecision Uoutcome Ver low Motor errors uncertainty Decision theory i m myg m i if udi b ou macaw 51mm airmmmm Optimal decisions may i i ity m Computing with utility Seems easy Make decisions to maximize EU But how do we estimate poutcomedecision Depends on how the world changes between decision and outcome Depends on how the world changes as a result of the decision All of this is probabilistic because of imperfect perceptions How do we combine all of this uncertainty to make an estimate in a dynamic environment Sensory integration Where will the ball land Old knowledge from past experience prior the ball falls near the lines New knowledge from observations likelihood the ball appears to have a specific trajectory CD Q pcl0 pa3p0lvp0 Decision Likelihood I Prior Posterior Dynamic integration What about changing environments Posterior from past becomes prior at next step New observations form new likelihood Essentially results in Kalman filter A Observation likelihood B iiefat39tquot Dynamics B ii39 if Beliefatt1 no gaging t1 prior posterior Internal models Wolpert et al 1995 propose an internal model to integrate sensory and motor information Task Arm movement in the dark with forces assistingresisting and estimate hand location Bias changes with force but not variance n m in is n 5 mm E 2 w m r us m 0 mm H i Di velViBNmesam as m is In 15 as in 5 n 25 m m as s in mm quotmm m Proposed model Proposed model uses Kalman filter to combine a model of arm dynamics and sensory information A 010 Wanted warnL 53934quot quot6 r 7 Wm Kmart mum39sz 17 1mm 7 M l l gt Piedcm I mow 5w T Sm magi WW4 umrl wrecm mun Viv can Aqu sensor vacuum a D 0 1 0 s as g as g 0 0 3 9 o 5 070 i l 0 05 10 15 20 25 05 10 15 20 25 c Time s E mu 8 r 25 l 10 5 E o 5 2 0 E g B D 0 5 l 5 05 gt 393 0 10 05 10 15 20 25 0510152025 mum 11mm Inverse decision theory If we assume the Bayesoptimal framework can we determine the prior likelihood or utility Inverse decision theory Observed optimal decisions l Likelihood prior Cost iunction Remember the Gaussian assumption about a prior on perceived speed Prior visual motion 1 10 Speed cms Sensorimotor learning Bayesian integration in sensorimotor learning by Kording ancl Wolpert 2004 Estimated lateral shift Probability MM Target Nlop 05 cm 1 cm 39 391 Prior lateral shift cm iFeedbaCk 0M A 19568 mt 0L I I1 l2 Finger path not visible Probabll ty 0 Visually sensed lateral shift cm H U C quot 7 Q 393 t A pt ac 3 Start Lateral shift eg 2 cm I r 1 No feedback P obabll ty True lateral shift crn 1937t 193375 338 Models Model 1 Use feed back exactly No estimate of prior or visual uncertainty Increasing visual uncertainty only changes variance Model 2 Use Bayesian integration min MSE More visual uncertainty relies more on prior Model 3 Learn mapping from endpoint when visible Endpoint only visible in unblurred condition Similar to 2 but no accounting of visual uncertainty Model 1 Model 2 Model 3 4 0 Deviation fr m target cm A Q A m C A 0 1 2 True lateral shift cm Estimating the shift Posterior is a product of Gaussians clue to prior IJ1cm and likelihood JXs SE a 2 6 6 MM p ti 5 m Si am MAP estimate is 2 2 A as 01 33MAP 10m l a 03 l 0 03 l 0 S A I L Devnation from target cm 0 O L l M True lateral shift cm Results E A E E u u E 9 E E E g 5 c m I 1 E39 F n r E 0 g a o 1 n quota quotM 9L L7 05 10 475 LD 25 Siope cm LEIGH shin cm Complex priors Can subjects learn a more complex prior such as a bimodal mixture of Gaussians u n mummy Deviauon rmm arget cm I Dewznon cram avget cm a 2 U 2 2 U 2 2 0 2 True aleva snm cm True Jamal smn amp True Valera shi cm ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 15 03032009 Bayesian inference We seen several examples of models based on Bayesian inference Sparse coding as MAP estimate in V1 Cue integration Sensorimotor learning These computations involve representing whole probability distributions What mechanisms could be used for that Probabilistic population codes Ma et al 2006 introduce the idea of probabilistic population codes to represent distributions Setup Assume each neuron has a tuning curve and the output is mapped to the intensity of a Poisson random variable lll39ll39 33099229929499 9 r 3931 0 v n O 45 1 45 15 0 45 Preferred eyeacemered pos tron degrees Actrwty Eye centered posahon degrees Representing distributions Denote the response to stimulus s as rr1 rN We want to represent the posterior psr The data is naturally encoded in the likelihood prls psr Olt prlss Assume the responses are independent and Poisson 6128 Z S 7 HAP psr 0lt m 73 Converges to Gaussian as number of neurons increases if prior is flat Moments Mean near peak activity Variance encoded by the gain Bayesian g m decoder E Preierred siimulus b 25 20 Bayesian E decader g 45 90 135 Preferred stimulus Slimuius 5 90 135 Siimuius Cue integration But more than representing probabilities we need ways to compute with them Consider two cues C1 C2 representing stimulus s 195l017 02 OC 1901 SPC2lSP3 With Gaussian likelihoods Gaussian posterior has moments Adding Poissons Assume cues are encoded in responses r1 r2 representing the likelihoods Consider what happens when we simply add the distributions Adding Poisson random variables produces another Poisson with a rate the sum of the input rates New gain is the sum of the old gains gglg2 C Acuvny 1 m we More adding Poissons Amivny 135 45 90 Preferred slimulus Pm r25 Still more adding Poissons b gt WSW 1 3 Evoked spikes in pupmanon A z 4 o u 2 c 20 d Evoked spikes i P m populahon B 5V5 a o 40 u 5 m 2 4 a 2 e f 20 a L7D 39 gt PS FA H Summed spwkes w a was 00 0quot a 40 75 a 5 V2 4 o x 2 Nemon39s meferred location Snmmus ocaucn S ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 9 02052009 Uses of info theory Seen applications to modeling Visual spatiotemporal processing Visual color processing Properties and distribution of synapses Today quick look in both directions Mechanisms for synaptic plasticity Explaining perceptual laws from vision Note how many levels of abstraction are being addressed with topdown models Synapses Varshney et al addressed general synaptic properties and distribution How and why do synapses change i 524332 V GB L t A EPSP VIPSP A O Mitochnlldrinn r X F Pnslsynanllccell k 2 Acdan F39menliai 73 i D Presynapiic inpumnhihlmry I Anaiag Digital T ggchonc Hiliock Synaptic plasticity Synaptic weights change in either direction Strengthening long term potentiation LTP Weakening long term depression LTD Mechanisms Change the sensitivity and number of receptors Change the biochemistry of the neurotransmitters Learning Why do synaptic weights change and how do they know Classical idea Hebbian learning 1949 When cell A repeatedly excites cell B the connection strength will be increased Cells that fire together wire togetherquot Nature unfortunately is more subtle Timing matters More recently discovered that the timing matters Spike timing dependent plasticity STDP If the presynaptic spike comes first connection strengthens If the postsynaptic spike comes first connection weakens Connection is strengthened due to causality and weakened due to simple correlation Can be viewed as a type of competition between presynaptic cells STDP Weight change from relative spike timing Change in EPSC amplitude At lt 0 At gt 0 100 g 80 C o W a 6039 9 tan O O f 40 9 o 20 00 JPquot o 0 3 395 39 0 u 008 o C O O o O i rr quot39 ltz quot o o 80 6 O 20 0 c9 I o 0 5x 40 0 r 4 I All Hi rquotis 3960 39 39 39 39 39 39 39 39 39 Le mm 80 40 0 40 80 18 d 9 Spike timing ms Win OW Bi and P00 1998 Sprekeler et al 2007 Information theory Chechik 2003 attempts to explain STDP through information theoretic principles Information theory generally talks about communication fidelity but neural systems process information Chechik proposes to maximize extraction of behaviorally relevant information Setup Neuron gets weighted presynaptic inputs 5705 Z 675 ts ts Xit Ft TSi 739 d7 Yt Z WiXit Presynaptic spikes Sit Synaptic transfer function Ft Total EPSP Yt Classification Task setup Choose stimulus z39ltt from discrete set kOM with probability pk z39ltt determines presynaptic spikes Sit Postsynaptic cell wants to determine which stimulus was presents Background noise pattern 20t is abundant so that pO gt Zpk for k at 0 Goal Maximize the mutual information between k and Y Calculations Invoke CLT to assume Yk is Gaussian Use Taylor expansions to calculate gradient 8 3Wz AWi A IYk A ZpkCovYXfilf EXflt k ZpkCovY XZQMIS l EXZQlt gt k Kk are constants Hebbian ancl antiHebbian rule Biologically realistic Impractical for biology because learning happens in batch you need all the data Implement on line learning Implement on line calculation of E and Cov Restrict weights to be positive to eliminate trivial multiple solutions How do you know when a pattern zk has been presented rather than 20 Requires supervision to tell the system when to learn Use uniform depression and trigger learning on postsynaptic spike Learning results J I W C tIIprajn uilearr ir ajl Going up Van Hateren 1993 goes the other direction to explain perceptual laws Sets up a similar model to Atick et al Assumes 1f2 spatial frequency and develops a model of temporal frequency Assumes optics are lowpass filter Assumes system noise and output units that have limited dynamic range Derives optimal spatiotemporal encoding filters that depend on SNR Remember for point processes SNR is related to signal intensity ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 13 02192009 Rhombus motion How do we combine various cues to determine what we re seeing We have approximately 12 cues for depth All are ambiguous to some degree Some may conflict High reliability should lead to more weight Bayesian approach pscene properties cues Motion illusions Weiss Simoncelli and Adelson 2002 use a simple model of visual motion with optimal observer Motion analysis begins with local measurements of directionally sensitive cells Local measurements need to be combined to eliminate ambiguity and form a global percept of the motion Grating motion Example plaids combine multiple gratings Gratings have infinite number of possible motions due to lack of info in the tangent direction Perceived motion in normal direction to constraint gt Plaid motion Together plaids have percept of single motion Multiple ways to estimate motion of plaid IOCintersection of constraints VAvector average Perception can be either depending on parameters gt Rhombus motion Remember the rhombus Narrow rhombus moves consistent with 10C or VA depending on contr t Fat rhombus moves consistent with IOC a b c vy 5 1 0 if 4 a i J V LXVA d e Optimal observer Ad hoc rule about when to use IOC vs FA would be unsatisfying Model for ideal observer Global linear motion for the image with no rotations shears etc Image intensity is preserved Local measurements are noisy 3379775 IvatyvyAttAtn n N N002 Motion is slow pv is Gaussian centered at zero Observer likelihood For local observations at an edge likelihood is a Gaussian around the constraint line For corner observations likelihood is a Gaussian around a point at true velocity LikelihODd Likelihood Likelihood Slimulus at location a x incaiinn h at locallon r Observer posterior Posterior distribution 19ml Olt 191 Hip1iiyiitv Can calculate MAP assuming smooth images Lower contrast gt more weight to prior Image b Image a Vr 0 0 x V v V in an Prior Likelihood 1 Likelhood 2 Likelih ood 2 4 FE x V Iquot Y r lquot X Posterior Posterior Changes with angle Matches perceptual observations of motion direction Image C Prior Likelihood 1 w i Posterior Sl Likelihood 2 Direction degrees l0 20 30 Rhombus angle degrees Other phenomenon Low contrast gratings are perceived as slower May explain our tendency to drive faster in fog Low SNR biases toward prior which is centered at vO Plaid motion is biased toward higher contrast grating Several other percepts also demonstrated Relalive speed Bias degrees Max Conlmsl 70 ax Conlrasl 40 l 2 3 L092 conlrasl ratio Notes for ECE8833 Lecture 16 Optimal detectiondecision theory 352009 H Why detection In the context of optimal inference7 we7ve mostly been talking about estimating con tinuous parameters eg7 rotation angle of a visual stimulus within a xed model But7 many decisions we make are actually more like deciding between competitive hypotheses eg7 did ljust hear a noise Standard model two known hypotheses H0 and H1 that each includes a generative model with known parameters that produce received data 7 Basic example H0 7quot 71 H1 7quot u n where n N N07 02 and the parameters A and 02 are both known Given the received data 7 which is the most likely modelhypothesis At the end of the day we have to divide the input space into two regions R0 and R1 so that we will decide H0 if r E R0 and decide H1 if r E R1 What can happen Four possible outcomes a Detection PD Pr decide H1H1was true b Miss PM Pr decide H0H1was true 1 7 PD c False Alarm PF Pr decide H1H0was true d Unnamed Pr decide H0H0was true 1 7 PF Total probability of error is P5 10710 PF 10711 PM How do we calculate these probabilities PF pr7 0 dr 731 PD pr7 1dr 731 Note that as R1 shrinks7 both probabilities tend toward zero The more conservative you are with limiting false alarms7 the more times you will miss For a given detector7 we often characterize its performance by drawing the receiver operating characteristic ROC curve7 which plots PF vs PD What is the optimal way to choose R0 and R1 therefore de ning the detector 2 Optimal detection rules a MAP estimate of hypothesis Can we just extend the Bayesian inference setting we7ve been using Posterior under each hypothesis 0 HOV Olt P l7 l0107 l0 0 Hllr Olt PTlH1P H1 Maximize posterior by comparing p TU1 p H1 21 10 WHO 10 H0 pawl 71 We 17Tl39H0 0 PWI The left hand side is called the likelihood ratio test LRT7 and it often written as A 10Tl7 l1 1 10Ho p TU 0 Ho p H1 The value 77 is called the threshold As is typical when dealing with probabilities7 it is often easier to work in the log domain Alternately7 we can write H1 log Ar 2 log77 Ho Demonstrate graphically with Gaussian pdfs Threshold location depends on ratio of priors Note that as R1 increases we can get PD to increase7 but only at the expense of increasing PF Label all of the regions on the graph This is nice and easy7 but is this detector optimal in any sense No reason to think it is a priori We haven7t optimized any criteria here b Minimum probability of error or maximum probability of correct P5 AMHOMQWO drR0pH1prlH1gt d How do we choose For every point r assign it to R0 or R1 depending on which integrand in smaller A O A CL V This comparison devolves into the LRT P l7 1 P TlHo 10 H0 10 H1 71 0 Minimum expected cost Not all outcomes have the same cost For example7 in diagnostics is may be much more costly to miss an event than to have a false alarm and have to followup with a patient Denote the cost of saying hypothesis 239 when hypothesis j was true 017 Expected cost is O 01110 Hi PD 00110H11 PD 01010 HO PF 00010H01 PF R 001p Hap mm Coop Hap mm dr 01010H0107 l7 0 01110 011 TU1 d7 731 Similar procedure as before For each datapoint 7 put it in the region where the integrand is smallest Results in the LRT again WW 71 PWHO 7i PH1001 011 When the costs are symmetric7 this devolves into the same threshold as the min imum P5 detector Neyman Pearson criteria Fundamental problem in that all of the above rules depend on knowing the prior probabilities of each model Unlike inference7 these probabilities don7t wash out77 as we get more data What can we do if we dont have them We know that as we become more aggressive with our decision regions7 we will increase both our probability of catching something and of signaling false alarms Neyman Pearson idea is to determine a decision rule base on nding the best probability of detection while constraining the false alarm probability maxPD st Ppga 731 Using Lagrange optimization we can derive a decision rule7 showing that R1 cor responds to the region where p TU1 gt Ap TU0 This7 of course7 is equivalent to the LRT with an threshold that has been derived from the constraint on PF 9 7 Show Gaussian illustration where the decision threshold is now set by the integral of the H0 distribution in R1 Very interesting that the same decision rule LRT keeps coming up no matter what criteria we use Recall that in the inference setting7 the optimal approach depended on the criteria This can all be extended beyond two models by doing pairwise comparisons Using the Neyman Pearson approach7 we can also set up a Null hypothesis test where we only know one model completely and we want to see if the data ts We can set one error probability and maximize the probably of detection without knowing the other model Perhaps show this graphically again Unknowns What if there are unknown parameters in the models For example7 H0 7quot 71 H1 7quot M n where n N N0UZ7 the parameter 02 is known but the parameter M is unknown Given the received data 7 which is the most likely modelhypothesis If we have a reasonable prior on the parameter7 we can take a Bayesian approach and integrate over it WW1 pltrm1ugtpltugtdu Another approach7 which is perhaps more common7 is the generalized LRT GLRT The idea is to do parameter estimation and detection simultaneously The steps are a Find the MLE of the unknown parameter from the data under each model b Use parameter estimate in the model and then perform the LRT In the example above7 we would nd the mean of the data to estimate and then use this estimate H0 7quot 71 H1 7quot 71 Performance To determine the performance of the detector7 we need to know the statistics of the likelihood ratio under each hypothesis Notes for ECE8833 Lecture 2 Math review 1082009 1 E0 Linear vector spaces We will view vectors are collections of numbers g 1z2 an so that g E R Need to de ne a way to measure length and distance Usually based on the 2 norm l l 1 2 lt 7ggt 221 Implylne that M2 v ltL gt 221 96 Vectors are orthogonal if lt ggt and orthonormal if they also have 1 We will often want to generalize this to the p norms where we de ne length by 1 Mp Z W 7 Special cases of interest M1 21 l il lglloo hmpr lllllp maxi i llzllo limpao Mp 211060 0 The p norms are convex for 1 S p lt oo Bases and frames A set of n vectors where b E R is called a basis or complete set if for every g there exists a set of coef cients such that we can write TL i 2 0111512 i1 Note that must be linearly independent meaning that b 21 cja j for any set of coef cients 07 Note that does not need to be orthonormal or even orthogonal but if it is we call it an orthonormal basis ONE and c Q 12gt is the unique set of coef cients representing g In an ONB we have Parseval7s theorem which tells us that the energy in the coef cients is the same as the energy in the signal 2 02 A set of m vectors where b E R is called a frame or overcomplete set if m gt n if for every g m 2 2 2 Allgll E 2 Mg igtl S Bllzll i1 for some constants 0 lt A S B lt 00 For the nite cases we consider the main concern is A 2 0 9 4 9 CT Generalization of a basis No longer a basis when m gt 71 Can still nd coef cients such that n 2 ENE but the solution is no longer unique There does exist a set of vectors called the canonical dual set such that g i Zlt 7 11 but in general 31 bi Special cases A B g is called a tight frame7 and in this case the dual set is easy to nd7 bl ia i A B implies that the frame is an ONB Matrices Given a square 71 gtlt 71 matrix A7 an eigenvector 3 satis es A1 A9 for some scalar eigenvalue A Eigendecomposition of a matrix is A PDP l7 where Q is a diagonal matrix of all eigenvalues and is a matrix with eigenvectors on the columns When A is symmetric7 the eigenvectors are orthogonal This implies that 71 B so we can write A Z AiyiylT A matrix A is called positive de nite if sAsT gt 0 for all g 31 0 This implies that What is probability Frequentist view the relative number of occurrences of an event if the number of trials went to in nity Makes us feel good to relate to experiments7 but kind of a fallacy since we7re relying on a limiting argument Bayesian view a measure of our belief or certainty of a potential outcome 70 chance of rain today More abstract and only related to the outcome of a single event7 but a powerful tool to make decisions with Axiomatic view none of this matterslet7s just de ne some functions and move on Sample space Consider an experiment with a set of possible outcomes 9 Could be a continuous or discrete set 9 is called the sample space EX 9 123457 67 Q red7blue or Q RN Probability Consider a subset of the possible events7 A Q 9 De ne Pr A as the probability that A occurs Probability satis es the following axioms I 00 p a r M b r191 c Pr AUB Pr A Pr B ifA B 0 P 20 P 1 Random variable RV Some sets are numeric7 but some are not So7 we need a way to map events to numbers so we can talk about them in the same terminology A random variable X maps the sample space to a portion of the real line A RV can therefore be thought of as a function For our purposes7 X can be thought of as the outcome of an experiment EXCXCQ R7X39 ROI X39 071 Distribution functions We need to consider two cases7 continuous and discrete a Continuous RV Cumulative density function CDF PX Probability density function PDF PX Properties pX 2 0 and 0 pX dx 1 Interpretation pX Pr X 6 x7 x A Discrete RV Probability Mass Function PMF pX Pr X Cumulative Mass Function CMF PX Pr X S Zing pX Properties pX 2 0 and ZipX 1 Pr X S x 0 10X 04 da A C7 V Expectation and variance Given a function of a random variable and pX the expected value of the function is EX lfXl ff fpx 6196 M EX lfXl Zl fPX 96 M E 04X 04E X EaX aEX The variance is de ned by Var X EX X i EX X2 EX X2 EX X2 U2 Var 04X anar X Var 04 X Var X Two or more RVs In order to consider multiple RVs7 we need joint probabilities Bax96711 Pr X S 96 HY S yl 0 fix pxy CV76 dad Marginal densities are given by integratingsumming out the other variables 10X 96 0 pXY 9575 d5 H E0 10X ZjPXy 1397 Expectations add EXYEXEY Covariance is a measure of joint variance COVX7Y EX X EX leY EX lYll EX lel EX le EX lYl Variances DO NOT ADD in general Var X Y Var X 2Cov X7 Y Var Y Conditional probabilities What is the probability distribution on X if we know that Y y pxiy 96W y pXY 9571 10Y 1 Independence Two or more RVs are independent if pr Ly pX py This also implies px y pX In other words7 the outcome of Y does not affect X When two RV are independent7 Var X Y Var X Var Y Correlation Related to covariance COVzy E ME Yl If E XY E X E Y7 then Cor X7 Y 0 and we call these RVs uncorrelated Correlation is a weaker notion than independence lndependence irnplies correlation7 but not the other way around Cor 9571 Bayes theorern 14121 10 My 7 14121 10 171 This implies the product rulechain rule p Ly p pzlyp Put all together p W Think about this in the context of decision making Consider a system with unknown input X and observed output Y This says that we can reason which input was most likely if we know the system behavior and the probability distribution on the input Random vectors We can extend the idea of multiple random variables to a random vector X XL7 7XNlT Many ofthe notions we have already de ned extend exactly as you would expect PDF pX 10XL7 7XN Expectation E dX B Covariance rnatrix Kg E 7 EX 7 EX E XXT i T 4 H r H CH Cov Xi7 X and Var K3 is a symmetric7 semi positive de nite matrix more on this later If M is a K gtlt N matrix7 then multiplying the X affects the mean and variance EmnMg Var M MKKMT Common continuous distributions a Uniform continuous X N Uniformab mmuwm 0 otherwise altxltb b Exponential X N Exponential Ae z 2 0 z p 0 otherwise c Laplacian X N Laplacianu702 ilkll 1996 hem d GaussianNormal X N Np702 1 EN 1095 We 2 Central Limit Theorem CLT Kolmogorov CLT is a dangerous tool in the hands of amateurs77 Given iid RVs with E 0 and Var 02 Distribution of Xi goes to N07 02 as N a 00 Common discrete distributions a Bernoulli X N Bernoullip 1 7 p z 0 1095 p x 1 0 otherwise b Uniform discrete X N Uniformm 1m k 017m71 My 0 otherwise Notes for ECE8833 Lecture 17 Sequential decision making vs Bayesian inference 3102009 1 E0 Decision theory vs Bayesian inference To make decisions we require evidence prior knowledge costsrewards lsn7t this Bayesian Bayesian inference requires posterior distribution to form decisions con dence inter vals expected costs etc Shadlen et al 2008 argue that we dont really need the full posteriors and its more plausible to think of using summary statistics to make decisions For example assume we get data from two sources Two possibilities combine them to make one optimal estimate or make two guesses and combine those The rst possibility combines whole distributions to form one optimal posterior and makes decision The second possibility makes individual decisions which can be easily summarized and given con dence values and combines those decisions This is a subtle difference but it would make a big difference in the implementation Sequential hypothesis testing Previously we talked about making a decision about a hypothesis based on 7 which was our total observation scalar vector etc Main result was LRT r H H1 r M g A pWHo Ho The threshold A is often chosen to set one of the error probabilities eg PF This basically sets our detection probability PD But what if we get our data sequentially so that rL T1T2 a decision on the y TL Could we make Fundamental difference here Instead of just deciding between hypotheses we also need to determine when we have information to make a good decision Sequential LRT Introduce a third decision region representing I dont know77 if you haven7t seen enough data Requires two thresholds A0 and A1 AI39L lt A0 say H0 A0 lt AI39L lt A1 say I dont know yet77 A1 lt AI39L say H1 ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 7 01272009 Combining colors Parvocellular cells have single opponent RFs B Parvocellular receplive fields Cone absorption spectra Violet Blue GreenVellow Orange Red Normalized absorbance Optimal color transformation Buchsbaum and Gottschalk 1983 propose an information theoretic approach Simple model of information transformation from phototransduction to optic nerve What form should A P and Q take to make communication most efficient Efficient transforms As before want to decorrelate variables to maximize entropy efficiency Eigendecomposition used to decorrelate and compact energy Remember this white data independent data corr data 5 Calculations Calculate eigendecomposition of covariance matrix C7 Org Orb C Cry 099 C919 Orb 0gb Cbb org fK7MRGMdd KWM ESASu ESAESu I WWW p WT g q b Assumption KM 6A u o Monochromatic images Properties Decorrelating transform is AW RM P 2 WT Gm QM B A W is unique A P and Q are orthogonal Energy is concentrated in A then P then Q One and only one eigenvector producing A will be all positive Next eigenvector has one zero crossing Last eigenvector has two zero crossings Results Eigenvectors show color should be expressed as luminence and opponency 1 m 5 2 393 v m E lhnnhnld Iml nrhmnuulr vhmmvl mu mm mm mu mmvnullwum Are we smarter Interesting early study taking a system design view Predicts a combination of luminence and opponent processing Weak assumption of monochromaticity Hard to evaluate apart from space and time Noise Opponent structure is true of ANY eigenvectors and not specific to this system ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 12 02172009 Prior beliefs PUG Causes are independent pa 1 1pm Causes are sparse Mai OC 6 Noise is Gaussian pIia X Infer ai via maximum a posteriori MAP est39mate pltaIgt oc pltIagtpltagt a argmaxpail arg main i 10gpltai1i 1 argmaln I ltIgtai AZCaZ Algorithms for inference 1 E argm n Pallg AZCWZ39 Need to solve this optimization efficiently Ideal cost function would simply count the number of nonzero coefficients known as the LO norm Results in intractable combinatorial optimization Ca Sparse cost functions How do we choose cost function C Choosing C to count nonzeros is intuitive but intractable If C is convex the minimization is tractable L2 norm L1 norm Lp norm plt1 Easy Tractable Difficut Not sparse Sparse Very sparse Gaussian prior Laplaoian prior Sparse approximation in DSP Two main approaches to finding sparse coefficients First approach uses heuristics in place of direct optimization Often very fast but few performance guarantees Greedv alqorithms Matching Pursuit MP At kt iteration find the dictionary element best matching the residual 9k arg maxm Rk17 Update the residual Bk 2 Rk1 9kltRk1 6kgt Mallat amp Zhang 1993 Sparse approximation in DSP Second approach softens the cost function to be convex Fast algorithms are a very active area of current research Convex relaxation Basis Pursuit Denoising BPDN 2 6 minMO st HZamqu a m a 1 Chen Donoho amp Saunders 1999 Ca Ca Computing coefficients Convex objective function means we could simply use gradient descent to infer coefficients dE dai bi 2 Go 1239 13W 37 Tdi II Where bl Z imyfwy 93 Gij Z i7y jm7y fAai 1i 139 Cai Current research BPDN is a special class of optimization problem known as a quadratic program QP Significant recent activity seeking specialized QP solvers for the BPDN problem Gradient based algorithms Interior point algorithms Homotopy algorithms Network solutions Activi leyel 3 Can a network of neurons using lateral and feedback connections compute sparse codes Gradient descent will drive coefficients toward zero but never make them identically zero Solutions approach sparse support but cannot reach it Requires communication between all overlapping nodes Proposal Locally competitive algorithms Rozell Johnson Baraniuk and Olshausen 2008 LCA dynamical systems gt U105 gt a1t Computational primitives a2tlt 2 1gt Leaky integration Noanearachann am 2 T um Inhibitionexcitation 205 1275 152 Mgt gt 1L2 t t gt UMGS gt aMt lt m733tgt Z Cbmaqsn Gin0P quotUmt amt n7 m TAU TILma l Kama Urn75 Z lt ma C1571 div73 T n7 m Com petitive a Igorith ms Activity level Sparse approximation via LCAs 2 E AZCam a Zamq m 1 2 55 lt m733gt Z lt m7 ngtan me lt magt Z lt m 4035 with um 2 am Aamam am 8am Interpretation um oc LCA observations E 80 am 8am urnLOC Warped gradient descent Enables thresholder Tx without E plateaus near um0 Favors coefficients that are already active 2 V Correspondence between Tk and C Threshold k controls MSEsparsity tradeoff Et is nonincreasing if Tx is nondecreasing d 8E 8E 8a 8E 8a 8E u u lt gt lt gt dt 8u 8a 8u 8a du 8a Costthreshold correspondence Threshold function Cost function um 04A 1 am 1 TOW um 1 evumA Cam XO ltTowgt 96 dag Threshold TrzAisition speed Additive 7 offset 04 2 2 2 AS 5395 1 3 1 J o 00 1 2 OO 1 2 U a Special cases HLCA and SLCA Softthreshold Hardthreshold LCA SLCA LCA H LCA 1 04 0 Ideal threshold 2 TltaooAum m5 1 1 OO 1 2 0O 1 2 m m 51 cost BPDN g0 cost Ideal cost 2 2 A 1 oz 2 A5 AS 00 1 2 C0 1 2 a a Spa rsity Simulation results on 32x32 image patches 500 A EHLCA QSLCA 400 GBPDNthr C SLCAthr CU g 300 MAP 0 i g 200 100 39 40 0 20 3 MSE error Regularity Sparsity is a refinement of the bandlimited model to include richer coefficient structure How do we build on this hierarchy to develop still richer models While images are often sparse they also exhibit tremendous regularity Among coefficients In space In time Approach 1 Approach 2 Time varying signals Dictionary M g Vn minatn0 st Y J 2 21 mamtn 3 6 h AA Wrath l V V V thmetn min Zn atn0 st HZm mtn gtrlt amon mom 3 e A A gt Still erratic V timed gt Temporal memory Model for time varying images 173 i7ya Vyy7t ait 39 V WWIV y low Learned dictionaries What do space time dictionary elements learn LCA properties LCAs are dynamical systems with state Solutions display an inertia property that reflects temporal regularity Algorithms just searching for sparse coefficients at each frame introduce a brittleness during changes Such irregularity and encoding artifacts confound our ability to understand dynamic image content LCA regularity Simulation on 144x144 video Coefficients used HLCA z 36 MP z 28 HLCA u HLCA 055944 004 ltgt SLCA 087394 002 o BPDNthr 093827 7 7 x MP 17056 0 7 002 7 W 39039040 50 100 150 200 N 01 J ratio of changing to active U1 a MP 1 01 i 0395 Ol W WWWV V ljl 00 5b 160 150 200 quot03910 50 100 150 200 frame frame Coefficients replaced HLCA z 28 MP z 85 p HLCA z 71 MP z 14 Conditional entropy HLCA 007 bitsstate MP 014 bitsstate LCA extensions Despite utility of sparse coding model we know the coefficients are NOT independent Is it possible to include explicit regularity terms in the cost function to capture more of this dependence We have recently shown that more complex models can also be inferred in the LCA architecture Block 1 12 oltagt Z z nENi Reweighted 1 Ca Zkilazl ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 8 02032009 Synaptic strength How do some synapses get stronger and produce bigger EPSPs than others 333332 V V V f EPSP L VlPSP 39 Milnchondrin I gt 7 K Posisynamlc De I I i t A J Ac on Pmmiaw V i 7 k igger Zane Hiliock Presynapiic Input inhihil ry Analog Digital Grow a backbone Dendrites have spines Increased surface area leads to more neurotransmitter receptors More receptors leads to bigger and more reliable EPSPs ll ll ll swan mus Kunkel Spines can change Synapse change their strength dynamically known as plasticity Thought to enable memory and learning Thought to be due to changes in the spines 3 Simple l l e l X xAx an in JlU lQV Wll l 4 eyz y e V Feneszrared Ho39seshoe Segme39ned Simple Na ure Reviews Neuroscience Hering amp Sheng 2001 Segura et al 2007 Size does matter Volume of spiny projection correlated with synaptic strength a b c i 2 E 2 5 n E s 2 s 7 u u 0 5 u O 0 m o a 04 o as o 04 0 na Spinehead volume iim i Spinehead volume m3 Matsuzaki et al 2001 Synaptic properties Varshney et al 2006 use information theory to address synaptic properties Connectivity is sparse so most neurons are not connected Synapses are noisy with high variability in EPSP generation Distribution of connection strengths has heavy tails Viewpoint A Source gt Encoder l Channel N0ise pBlA Source Destination lt Decoder 4 13 A Present gt Encoder l lt Storage Noise Mgrng g In Situ Noise P lt Retrieval Noise Future lt Decoder 4 Shannon s model of communication Varshney et al model of memory storage Each synapse is a channel usage Increasing the capacity increases the number of memories we can store Synaptic SNR Synaptic weight A is mean EPSP level Synaptic noise Al is EPSP standard deviation tII quotr r 1 51 39739 9 71 39 Lff 3 1 5 A 45quot w p L quot ll 239 all Carandini et al 2007 SNR related to volume 1 A VN AN Optimal properties Volume is a scarce resource Want to maximize storage capacity per unit vmume Assume noise is Gaussian synaptic weights independent and apha2 V0 is accessory volume or overhead 1 A2 1 V synapse i In 1 1n 1 synapse 1 ltVgt VOlume V V0 2 V0 n VN Maximize capacity per volume Small overhead yields best capacities and noisy synapses Large overhead reduces capacity and yields higher SNR synapses Best delay properties when VOVVN implying V0 small gt ity of Volume VN Information Storage Capac Information Storage Capacity of Volume VN w 9 O F 5 03 1O 12 14 16 18 20 o 2 4 6 8 Normalized Average Synapse Volume ltVgtVN Average Synapse Volume ltVgtVN Distribution properties Previous results were only for average ltVgt What is the optimal distribution on V Previous results extended to other distributions pretty well but will not give the detail necessary here Gaussian is probably not a good model because of positive only constraints etc Discrete states model Noiseless model that treats EPSPs as being binned into discrete states Mutual information reduces to input entropy because there is no noise EPSP volume amplitude probabzlzzj dzstrzbutzon 0P3 oP2 OP1 OP0 Optimize distribution Maximize synapse 211 mm ltVgt Resuk Pi e vi Assume V7 V02WN Calculate average volume Note that pO is the probability of NO synapse Distribution statistics Can calculate the average volume and the filling fraction Probably want average volume comparable to theDwiring volume overhead 10 Filling Fraction f Normalized Average Synapse Volume ltVgtVN ES 2 1 0 1 2 1o 10 10 10 1o Accessory Wiring Volume V0VN Deriving cost functions Given parameters we can optimize storage capacity per volume Cannot optimize over the parameters Reverse the problem assuming an input and channel can derive the optimal cost Assuming Gaussian input and noise VA2 Can also calculate optimal cost from experimental data Optimal cost Calculated from data on synaptic weight distributions and noise levels Power law with exponent 048 A Cost function arbitrary units V 0 2 4 6 8 1O 12 14 Mean EPSP amplitude mV A ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 18 03122009 Sequential decision theory Nice review by Gold amp Shadlen 2007 a Sequential analysis framework efJ a 60 2523p 61 fI 6061 gt 520 62 a fi eo ewez Sgt ene e0elengt b Symmetric random walk Choose H l 395 1 0 L a lt13 a Mean of 9 depends U gt lt 3 g o 0 on strength of E quot65 evidence a L 2 w 2 Mean drift rate mean of e A Choose H2 C Race model Choose H1 Choose H2 A g B s L 41 0 o O m 39 5539 3 8 3 8 E g E c 3 395 3 g 0 I U lt o lt a Vibrotactile frequencies Determine if second stimulus m m I was higher or lower freq f 2233 Requires working memory 81 neurons only encode sensory evidence but not decision Neurons in premotor cortex start to look like decisions 0mm 4quot 0 mm mm d f1 f2 KU PB a W llllllllll I A A A A A A 2634 s 500 ms ll I I Base f1 Comparison f2 KU PB b 1 v 08 f1 20 H2 Proportion of choices 39 r2 gt f1 04 39 02 o o l l l l 10 15 2O 25 30 Frequency of comparison Hz Random dot motion Determine direction a lift 39 l of dot motion 1 f 7 Accuracy improves 39 A A with time Stimulate Frontal A c Eye Field to induce saccade 128 8 DeVIatlon updown 5 64 100 260 40390 80b 5 0 5 10 re I a Viewing duration ms Xposition confidence In d 6 15 512coh256co deCISIon due to 3 SNR and time Not carrying out central decision 3 100 300 500 Viewing duration ms Random dot reaction time Subject chooses decision time Accuracy and time improve with SNR LIP neurons show integrating evidence until decision Inputs seem to come from difference of MT cells Firing rate sp s4 7039 O O v 01 O I 4 O 20 30 Motion strength 512 128 0 Select Tin Select Tout 45 MT 5 0 200 400 600 800 200 0 Time ms 039 Percent correct ms Mean FlT Behavior 1 08 06 0 4 L 900 800 700 600 500 400 IL O 25 5 10 20 40 Motion strength coh RT ms 400 70 500 600 700 60 39 800 900 50 40 30 1000 500 0 Time from saccade ms Reaction time under stimulation Stimulating MT appears to increase rate of evidence accumulation Stimulating LIP appears to shift evidence level a b DV in LIP Bound for right choice stim adds cumulatively Bound for left choice 1 DV in LIP Stimulate rightward MT neurons gt O 4ft Mamentary evidence in MT A Mean Evidence 0 for right 0 R L A Stimulate right choice LIP neurons 0 O 439 A 4 X Momentary evidence in MT A mean Evidence 0 for right 0 R L A Bound for right choice Stim adds constant Bound for left choice Relatively large effect on choice and RT equivalent to added rightward motion Proportion rightward choices Reaction time 0 Motion strength Strong Strong leftward rightward Small effect on choice modest effect on RT not equivalent to added rightward motion Proportion rightward choices d E C 2 5 U a a 0 Motion strength Strong leftward Strong rightward Detection of vibrotactile stimulus Vibrotactile presented at unknown delay Monkey indicates presence of stimulus When do you start integrating evidence Leaky integrator 51 cells respond more strongly for higher amplitude stimuli Premotor cells respond strongly for yes decisions O 5 10 3O Stimulus amplitude pm 34 399 0pm l v 0 7 E 0 20 4o 60 Firing rate sp 3391 l llll 0 5 10 2030 Stimulus amplitude pm Visual categorization Yang amp Shadlen 2007 test monkeys on a sequential categorization task Each shape contributes to the probability that they should look to the red or green circle 2400 2600 ms Fixation off saccade a 2000 ms shapes off 39 1500 ms 4th shape on 4 1000 I 1 827 837 3rdshapenosn gt 500m339 p 817 827 837 2nd shape on O D O F aaaa ringreu Oms Q Targeton D E 09 lst shapeon 39 O 07 g m lt7 05 g l 30 03 m Fixation C g O 3 g 8 05 639 ltgt 07 g A o9 ltgt Favouring gre Percentage of Logistic coef cient 0 red choices M Jgt 03 D O o c o o o subjective WOE Performance Monkey J Monkey H 2 4 4 2 0 Evidence for red logLR Monkey J 08 Monkey H 39 04 0 39 704 708 0 1 w m4 0 Assigned weights Data from LIP average Responds more a 40 Epochi Epochz EpcchZ Epoch4 strongly over time for one decision 30 Average firing rates 3 increase more with w HI LR g og stronger eVIdence I 20 m M D 600 0 6000 600 D 600 h Timems 2 LA 1 04 2 4 1 02 25 1 02 26 1 0 1 20 I f I f E EWA 4 n 4 40 4 4 4 o iogLR ban Count 0 a LiitiiLiii 7393740 3 a 75740 A a 78740 A a 7874i 4 a Slope 5p 5 banquot 0 Effect of individual shapes If LIP is performing SLRT we should see a change in firing rate for each new piece of evidence a39 Average change in ingn 2 ban a Epoch 2 Epoch 3 Epoch 4 c 0 an W e 7 h M V 0 a 0 v A a W 0 EL m g 5 600 600 600 Time ims g 5 26102 2701 2501 a a 0 V o a E E o 2 o 2 a 2 z ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 19 03242009 Projects Website is updated Things coming up Proposal Progress report Presentations 20 or 25 mins depending on size Day 1 Balavoine Kailas and Millard Day 2 Harper and Guvanasen Report 15 pages Evaluation personal Is energy important Attwell and Laughlin 2001 model energy consumed by gray matter Brain is 2 of body weight but uses 20 of its resting metabolism Energy availability could limit brain size circuitry activity patterns etc Energy needs Postsynaptic currents Action potentials Neurotransmitter uptake Energy distribution How much do each of the processes contribute to energy consumption Synaptic signaling 40 Distribution of ATP consumption for a mean actlon potential rate 0M Hz AP production 47 Overhead 13 resting potentials 13 3 giurecycnng a 3amp7 presynapticcah postsynapuc receptors 34 action potentials 47 Distributed coding How should a population of cells encode 100 different categories Denote resting energy by R Denote active signaling energy A Candidate 1 100 cells with one active Energy 100R A Candidate 2 15 cells with 2 active Energy 15R2A Energy usage arbitrary units Optimal distribution of code Depends on relationship between A and R which depends on firing rate Firing rate depends on temporal resolution Implies sparsity but not to an extreme Distributed coding improves energy ef ciency Higher temporal resolution implies sparser coding 5 05 40 Hz action potentials E c 04 S a 2 2 3 03 00 1 N 4435 E 02 a WWW 2 MW M 062Hz 01 t w Q kW E 0 7 39 LL 00 01 1000 O 5 lO 15 20 Cells active to encode i of 100 conditions Action potential frequency Hz Synaptic weights Sparsity assumes a connection pattern and tries to minimize activity Sparse coding results Olshausen amp Field 1996 try to adjust connections to minimize activity But there is metabolic cost to the connection itself Vincent amp Baddeley 2003 optimize retinal RFs while minimizing synaptic signaling costs Energetic cost of a synapse is related to strength More neurotransmitter release and uptake Larger EPSPs Assume cost is proportional to strength Basic model Model retinal RFs as linear combinations of photoreceptor outputs y W2 May be convergence ie y is smaller than x Error signal e 1 W y Want to minimize MSE EleZ Constrain total synaptic strength on each target Receptive fields Unsupervised learning on a database of images With energy constraint top shows centersurround organization Without energy constraint bottom shows global RFs I b Effects of energy budget RFs can do better at 6 representing image with higher energy budgets relative error ma Center surround may be an cunhr sumnd munipr suimm s ideal LUIII39JIUIIII r Efficiency MSEtotal cost Optimal number of output cells increases with energy budget Ef ciency n u u 2 low budget 5 aka n 5 ma 4 high Dudgal relative budget a total amale mm ea u Low budget gt convergence b Fits to retinal models Center surround RFs in retina often modeled by difference of Gaussians Population properties RFs show irregular organization with roughly uniform spacing Spacing ratio is distance between centers divided by sum 0 the ra II in an Simulation results can be 25 within physiologic ranges 2 but depends on para meters is n n 5 1 in spacing mm 15 Retinal eccentricity Convergence increases with eccentricity Adjust convergence to simulate eccentricity Filters become more broad and have stronger surround as seen in retina 11 211 4H 811 Population effects Vincent et al 2005 extend this idea to populations over the whole visual field Irregularly spaced photorecptors leads to irregularly spaced center surround RFs Other population effects show right qualitative behavior but significant quantitative differences ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 23 04062009 Endstopping Visual cortex has cells that respond to line segments at a particular orientation These cells often respond preferentially to line segments of a certain length Extraclassical RF effects suppress longer lines i Activity level More endstopping Effects are more pronounced when properties of the center orientation etc match the surround What would eXplain this end stopping Predictive coding whitening has explained spatial temporal and color processing in early vision Rao and Ballard 1999 postulate that end stopping effects can also result from predictive coding Prediction is generalized from our previous notions to include the Bayesian inference perspective Basic model Hierarchical predictive coding Feedback from higher cortical areas predicts the lower area Relies on an internal model based on scene statistics Feedforward from lower cortical areas represents error signal in that prediction Feedforward Error Signal Error Signal Error Signal a In m Predictive p Estimator Estimator Tlnhibition l L Prediction Prediction Prediction Feedback Predictive Area interactions Statistical model j Image I Synaptic weights U Neural responses r Sigmoidal nonlinearity f C I T r 1 td U T We r r a quotx r T f U L rtd Local Image 7 52 Patches 4 I A PE I 1quot V 5 I I 77 L8 PE M PB Level 0 Level 1 Level 2 Area interactions Coefficients r are represented with higher level coefficients rh td 2 7h h 7quot 7quot r rtd I ntd f J J J RFs increase in size at each higher level Lower coefficients r cover one patch Higher coefficient rh cover several patches Level 0 Level 1 Level 2 Learning Assume Gaussian noise models and coefficient prior Find MAP estimate of coefficients given an image Find MAP estimate of basis given image database Second eve basis indicates co activation patterns of first level coefficients i 393 Level 1 V612 Leve1 39 I P i i E g g i f 39 l velz i i i p If i 5 11 Error signal qualities In model error signal is propagated on feedforward pathways to higher level In visual cortex layers 23 send a large number of feedforward connections up the pathway Layers 23 also show significant end stopping lupin hilput I39Inm lrmn Lll l 39 Irquot m3lu39ur ccllnh quotlynx J i F In lulu In Ilap Aims LI stmcturcs urn1 m lwram Physiology experiments performed on model Mode Error signal is smaller a for long stimulus Why As bar length increases top down signal better matches coefficients Images show increased autocorrelation in preferred directions Response behavior Input Error Hm eve 02 R pn v F Id 0 V l rquot o 939 I il l lli l l w 0 i Il l ll l quotl llll A a Level2 39OI 16 32 Receptive Field 1 1 16 32 Neuron number Model neuron 1 Level 1 reponses Predictive feedback r rt Irl llllHII H WW l ll quotll l ll l Model neuron 2 IO 15 20 Bar length pixels 0 5 20 25 1 0 15 Bar length pixels Data Model matches endstopping from data Model neuron Cortical neuron N Without feedback 30 7 3 I o o E 70 0 Layer 6 inactivated I 60 607 g so 507 E 40 40 E 30 30 B h I I E 20 With feedback 20 e aVIOl39 a 0 g m 10 Layer 6 intact no 2 4 6 8 161214161820 00 1 2 3 4 5 s 7 s h h e r Bar length pixels Bar length deg p O n b Model with feedback 12 Model without feedback removed a m on 5 x ex on 5 Number of neurons N N omm mo m 0 IO 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Min Max in Max Degree of endstopping Other extraclassical RF effects Sparsity penalty imposed to localized RFs why Responses same quality as seen in visual cortex Leve1 wit 5 ar 14 q k r h p se pri stribution 7 V Popout texture 12 Eh3 h HWaM EEME 3 j j I 08 EHEEIM mEIHMMHE ltgx ii i 39 a 1 c Homog eeee us 8 rand ttttttt re 6 4 M 9 2 D Ori eeee ti oooooooo st I a ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 6 01222009 Facts of life in info theory Recall that distributions with higher entropy are considered maximally informative Discrete distribution maximizing HX is uniform Continuous distribution maximizing hX depends on constraints Scalar bounded in ab gt Uniformab Scalar in 0oo with meanp gt Expp Scalar with variance of 02 gt MO oz Vector with covariance K gt NOK Dept of the redundancy dept Remember entropy is not bad Higher entropy means signal is almost as compressed as possible AWGN channel Basic IT results are digital communication But digital symbols still get transmitted over analog medium Example 01 gets mapped to a pulse signal with amplitudes in PP Example 00011011 gets mapped to a pulse with amplitudes PP2P2P X is continuous RV with power constraint P2 Channel adds Gaussian noise YXZ with ZN0a2 Capacity bitstransmission is 1 P2 C I XY 1 1 pwg1 2 P2 7 2 0g2 02gt IT and neuroscience What does any of this A Muthunutticul Thcury of Conununicutiou 83 T l Slir xXNIfilN lnlormallon Source Transmitter u Signal Message Receiver Destination gt Received Signal Message Have to do with any of this Frontal lobe Cerebellum 39limlponl lobe Bumstt m Spinal mi in mm irlm MW 2 5w w Noise Source Dendrue Soma cell body Nucleus Myelln sheath Axon terminal button What does it mean Are the fields asking the same questions What type of system do we see in neurons Can we measure the quantities of interest Are we measuring HX or hX What are the effects of binning spikes Entropy and mutual info depend on experimental choices of stimuli We know the capacity of point process channels What does experiment tell us and how do we interpretit Information theory remains controversial in analyzing neural data The Bandwagon 39l l39hl i li FH3939HN IRE Transactions on Information Theory 1956 Seldom do more than a few of nature s secrets give way at one time It will be all too easy for our somewhat artificial prosperity to collapse overnight when it is realized that the use of a few exciting words like information entropy redundancy do not solve all our problemsquot Top down models Remember the design vs analysis conflict Rather than measure for analysis what if we think about system design Main principle Efficient systems should be operating near capacity for the signals they normally deal with Extra capacity amounts to wasted resources Natural signals What is there to do Doesn t it take z8N bits to transmit an image with N pixels from our eye to our brain with good fidelity Natural signals have structure Structure reduces entropy This means we can lower channel capacity Want to transform the stimulus into more informative higher entropy encoding that uses fewer symbols Problems Why not have the retina transform images into a code with uniform distribution Joint statistics are complicated May be beyond the hardware constraints What about noise Solution divide process into stages taking care of simpler low order structure first What is simplest structure Power spectrum having 11 2 decay Linear retinal processing Atick amp Redlich 1990 proposed a simple retinal model Photoreceptors Gangion Ceils Transduction noise v z Transmissmn noise 6 Transfer Funcbon X Retinal W gt processing wAx GanghonceHs yw6 Photorece pto rs xsv y IT quantities Can we calculate IXs or Iys Simplifying assumptions Signal and noise are independent Noise is white uncorrelated Known signal correlation ie power spectrum Distributions all have maximum entropy Max entropy distribution under a covariance constraint is jointly Gaussian Re nal processing wAx Photorece pto rs xsv GanghonceHs yw6 Calculations SM smgt ROW m lm mlmlgt Reina ml 035mm Rm m 37175mgt 1 detR0 031 23 s i10g2 det031 1 detARo 031 V 021 375 i log det03AAT 0 6 gt Mutual information decreases through Increasing noise Increasing signal correlations C16313030 S H Rein Equality when RO is diagonal Channelcapacty Calculate capacity of output channel wgty Constrain output power ltyTygtconst Gout y max 31310 PwltyTygtCODSt 1 wa 2 max log2 det 06 pwwamconst 06 1 wa l 0 ilogZU 0 I 2 1 ARo l 03AT 02 1Og2 H 2 6 i 05 m Redundancy How much capacity is being wasted Define redundancy as fraction of capacity being wasted Iys Couty Assume Iys is constrained to the amount of information needed by the organism How much is that Assume Iys IX5s What choice ofA lowers redundancy R R 39n S Photoreceptors X en a W Ganglion cells y processmg xsv VW5 wAx R21 No noise What if there is no transduction noise Noise free case termed redundancy reduction by Barlow in 1961 1 AROAT 03921 Couty 10g2 H 02 6 7 6 ii Optimum A makes AROAT diagonal Whitening C0uty Iys so that R0 1 detAR0AT0 I y s 510g2lt deta 1 Noise But removing all the redundancy is bad when there is noise present No robustness Minimize Couty with constraint on Iys Use Lagrange multiplier EA Couty Amy 5 53 63 3 To solve Assume RnmRnm Use Fourier domain and numerical integration Choose unique solution that is local and rotationally invariant n I t o a I a a I I I O u n What are connections for a ganglion cell White is excitation and black is inhibition Structure of A SNR changes 33333 In High SNR 1 High pass filter Whitening of spectrum quot Iquot 39 Low SNR quotquot Low pass filter Smoothing quotquot quot39 quotquotquotquot Medium SNR I Band pass filter Balance between 39 quot g 3 competing goals Temporal properties Video is strongly correlated in time Same procedure Dong ampAtick 1995 shows analogous results for temporal whitening Yes RFs can have a temporal component Dan Atick and Reid 1996 examined LGN RFs in space and time LGN neurons Responses temporally whiten natural scenes cm a m l IleVM Power Frequemy nu LGN neurons with white noise Responses DO NOT whiten white noise cell WNW Pavmr Frequent in ECE8833BME8813 Special Topics Information Processing Models in Neural Systems Lecture 5 01202009 Information theory Shannon CE 1948 A Mathematical Theory of Communication Bell System Tech Jour 1948 The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point Source and sink can be separated in space or time Enables basically every digital device and system Information Encoder Receiver Information source transmitter decoder sink The father of information theory Claude Shannon 19162001 MS thesis Boolean algebra ltgt electrical switches Foundation of practical digital circuit design Inventions f1 rocketpowered flying discs quot 39 39 motorized pogo stick juggling machine roman numeral calculator mechanical mouse computer chess program the ultimate machine Fundamental questions How complex is a signal How bad is a channel What are the fundamental limits for information encoding and fidelity Encoder Receiver transmitter decoder Signal complexity Sis drawn from a finite set 51 sM Ex alphabet 01 experimental stimuli S is probabilistic drawn from ps Entropy in bits HS Zpsm10g2p8m HS 2 O HS 0 when psm1 for some m HS logZM if psm lM for all m Uniform distribution maximizes entropy Only depends on probabilities not signals Source coding Entropy measures the informationsurprise Source codinq theorem The smallest number of bits B required to represent a discretevalued stimulus 5 without decoding errors satisfies 195 g B g 195 1 Bits gives a uniform currency to talk about different types of signals Signal structure Statistical structure reduces entropy Uniformly distributed signals cannot be compressed much Uniformly distributed signals can be viewed as being most efficient Joint entropy HX Y m Zypy10g2pa y Dependence reduces entropy HX Y 3 HX HY Equality when independent Conditional entropy HYX Ex Zyp va y 10g219ltylivgt S HY How does knowledge of X affect entropy of Y Equality when independent 0 when XY Continuous signals What about signal families with a continuous index eg indexed by sin O1 Can define differential entropy MS 298 Iog2pltsgtds Can be negative or positive Gaussian 128 10g227re02 Source coding theorem cannot apply Would take infinite bits to represent 5 hS is not the limit of HSd for SC a discrete approximation to S Iwwm rs AT V Error free communication Can improve by blocking together multiple bits and adding joint error correction How do we correct all errors Seems like we have to drive our rate to 0 Shannon says no Characterizing channels Mutual information IXY pway10g2 W dmdy hY 1933 My th hX hXY Discrete version with sums is the same Measures dependence between X and Y Always positive IX Y 2 O Equality when independent Max when XY HX for 00 Depends on both channel and input Channel coding Remove dependence on input by optimizing l Channel capacnty bitss 0 pg c TIXY Noisy channel codinq theorem If the data rate R produced by a discretevalued source is less than C there exists a channel coding scheme which corrects essentially all errors Furthermore if RgtC no scheme can prevent all errors Proof is not constructive
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'