You are on page 1of 14
‘US 2014000: cu») United States cz) Patent Application Publication — co) Pub. No.: US 2014/0003710 A1 oy ow 2 on) SEOW et al. UNSUPERVISED LEARNING OF FEATURE ANOMALIES FOR A VIDEO SURVEILLANCE SYSTEM Applicant: BEHAVIORAL RECOGNITION SYSTEMS, Ine, Houston, TX (US) Inventors: Ming-Jung SEOW, Houston, TX (US); Wesley Kenneth COBB, The Woodlands, TX (US) Appl. No 13920,404 Filed: Jun, 27, 2013 Related US. Application Data Provisional application No. 61/666,359, filed on Jun. 2012. Publication Classification Int. € Gu0K 900 (2006.01) OA 371 (43) Pub, Date Jan, 2, 2014 606K 9/0071 (2013.01) ‘382/159 on ABSTRACT Techniques are disclosed for analyzing a scene depicted in an ‘input steam of video frames captured by a vdeo camera In fone embodiment, e, a machine learning engine may include statistical engines for yenerating topological feature ‘maps based on observations an a detection module for etocting feature anomalies. The statistical engines may include adaptive resonance theory (ART) networks whieh cluster observed position-feuture characteristics. The statis- tical engines may further reinforce. decay. merge, and remove clusters. Thedetection module may calculate rarenes valve relative to recurring observations and data in the ART net- works, Further, the sensitivity of detection may’ be adjusted according to the relative importanee of recently observed ‘anomalies ‘CONPUTER SYSTEM 5 _ 120 128 cru ‘STORAGE INPUT? | output toh DEVICES 198 40 MACHINELEARNING |_| al MEMORY COMPUTER VISION | 18 ENGINE ENGINE 00 Jan. 2,2014 Sheet 1 of 4 US 2014/0003710 AI Patent Application Publication aut S3OIA30 ANdLNO JLAANI ay b ‘Old oet 3NIONS SNINYW31 ANIHOWN, 3NIONS NOISIA YaLNAWOD SOWSIN ov set bee BOVYOLS Ado set ozt WALSAS H3LNdWOO on gounos ANNI O3GIA ( sot US 2014/0003710 AI Jan. 2,2014 Sheet 2 of 4 Patent Application Publication ~ [082 aTNGOW Ho19aLaa ANION ONINWY31 ANIHOWN osz ez _. kwon ( ae 3NION3 Twnd30u3d wez SHOWS 5 042 rWwousivis { dig0sid3 NOWWOrddY (g92_UBAV1 Z0NANDAS} a —H| 5 yet ton z Old ANaO | [08% waAvIUaISNTO|}=— fndLno —— eee — nd [sez _ walsissvi0 32 S1313000 Suna ONIN La ov @___sn@ 1NaAa SdO19aN aUNLWad oomsoimenn 1 __ 30mm IKSOLORPVEL 193°O SNOLLWOISISSYTO annowexove — ore asia ( are os AN3NOdNOD [FI 5 ~atg | nly p~ seh WOSS300¥d t — AXALNOO INaNOdWOO -—-— | oz\~S a) walsungor [td \Y coz om e — ANaNOdNOD | — aN ( ( 7 = 3ounos s . LN3NOaNOO 3NIONA S NL Og08 LndNi O3GIA NOISIA a waldo Patent Application Publication _Jan. 2, 2014. Sheet 3 of 4 US 2014/0003710 AI 300 START) x RECEIVE KINEMATIC AND FEATURE DATA FOR OBJECT |~ 310 IN VIDEO FRAME t —~ 320 FOR ONE OR MORE FEATURES: }-+——_ ee ee DETERMINE POSITION-FEATURE VECTOR ee ee RETRIEVE FEATURE MAP FOR THE FEATURE eee MODIFY FEATURE MAP BASED ON FREQUENCY a OF RECENTLY OBSERVED ANOMALIES — “en” DETERMINE DISTANCES TO CLUSTERS OF FEATURE MAP | ~~" ‘SELECT CLUSTER(S) SURROUNDING THE 365 POSITION-FEATURE VECTOR AND SELECT A CLUSTER HAVING THE SMALLEST DISTANCE DETERMINE RARENESS VALUE BASED ON PSEUDO-MAHALONOBIS DISTANCE TO THE CLOSEST |~ 370 CLUSTER AND THE STATISTICAL RELEVANCE OF ‘SURROUNDING CLUSTERS 390 REPORTING Bory CRITERION FEATURE ANOMALY UPDATE FEATURE MAP USING FEATURE-SPECIFIC STATISTICAL |-~————! ENGINE FEATURES TO ANALYZE? FIG. 3 Patent Application Publication Jan. 2,2014 Sheet 4 of 4 US 2014/0003710 AI fo 400 4107 n opt ato, i ra L ~ a \ sia] rn > | 4105 ym 2 us)" FIG. 4 US 2014/0003710 AI UNSUPERVISED LEARNING OF FEATURE, ANOMALIES FOR A VIDEO SURVEILLANCE ‘SYSTEM, (CROSS-REFERENCE TO RELATED "APPLICATIONS, [0001] This application claims priovty o U.S. provisional ‘application having Ser. No, 61/666,389, filed on Jun. 29, 2012, which is hereby incorporated by reference in its ‘entirety. BACKGROUND OF THE INVENTION 10002] 1, Field ofthe Invention [0003] | Fmbodiments of the invention provide techniques oranalyzing a sequence of video frames. More particulary, to analyzing and learming behavior based on streaming video data, inclading unsupervised leaning of feat ancmales 0004} 2. Description of the Related An 10005] Some curently available video surveillance sys- tems provide simple object recognition capabilities. Por ‘example, a video surveillance system may be configured t0 ‘classify a proup of pixels (referred to as a“blob”) ina given frame as being particular object (©... person or vehicle) ‘Once identified, a “blob” may be tracked from frame-o- same in order to follow the “blob moving through the scene ‘overtime, e.g. person walking across the field of vision of a vidoo surveillance camera. Further, such systems may be ‘configured to determine when an object has engaged in cer- in predefined behaviors. For example, the system may include definitions used to recognize the occurrence of a numberof predefined evens, e- the system may evaluate the appearance of an object classified as depicting a car (a vehicle-appear event) coming to a stop over a number of James (a vehiclestop even!) Thereafter, a new foreground ‘object may appear and be classified as 2 person (a person- ‘appear event) and the person then walks our of frame (3 person-isappene event). Further, the system may be able to Fecognize the combination of the firs two events asa "parke ing-ovent” 10006} However, such surveillance systems typically are luable 1 idently or update objects, events, behaviors, oF pattems (or classify such objects, events, behaviors, et. as being normal or anomalous) by observing what happens in the scene aver time; instead, such systems rely on static pattems defined in advance. For example, sich surveillance ‘ystems are unable to, without relying on pre-defined maps oF pattems, distinguish feature anomalies (eg usustal shini- fess ata particular location) ina scene from ordinary Features (eg, ondinary shininessat the same location) in the seene and report instances of feature anomalies tow user SUMMARY OF THE INVENTION 10007] | One embodiment provides method foranalyzing a scene observes! by a video camera. The method! includes receiving kinematic and feature data for an object inthe scene and determining, via one or more processors, 2 position- Feature vector from the received data, the position-festure vector representing a loetion and one or more feature values ‘atthe location, The method further includes recieving a fea ture map comesponding to the positon-feature vector, wherein the feature map includes one or more positon-fea- ture clusters. In addition, the method inchades determining & rareness value forthe objet based atleast on the posit Jan. 2, 2014 ‘eature vector and te Feature map, and reporting the objectas ‘anomalous if the rareness value meets given criteria, [0008] Other embodiments inckude a computer-readable ‘medium that includes instructions that enable a processing ‘unit implement one or more embodiments ofthe disclosed rmthod at well asa system configured to implement one oF ‘more embodiments of the disclosed method. BRIEF DESCRIPTION OP THE DRAWINGS. {0009} So that she manner in which the above recited fea tures, advantages, and cbjects ofthe present invention are attained and can be understood in detail, a more particular description ofthe invention, billy summarized above, may be had by reference to the embodiment illsteated inthe appended drawings {0010} Isto be noted however, that the appended dra ‘strate only typical embodiments ofthis vention sad anetherefore nto be considered limiting ofits scope othe invention mey admit ober equally effective embodiment {0011} FIG. 1 iystates componeats of a video analysis System, sccording to one embodiment oF te invention. {0012] FIG. 2 fare ilustaes component of the video analysis sytem shown in 1G. 1, aecoding to one embod ‘eal othe invention. {0013} FIG. ilutates a method for detecting an report fg Teature anomalies, according to one embodiment othe [0014] FIG. 4lsratesanexample featuremap according thane embovimeat of the invetion DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS, [0015] Embodiments of the present invention provide a ‘method and a system for analyzing and Tearing behavior based on an acquired stream of video frames. A machine- Jeaming video analyties system may be configured to use & computer vision engine to observea Scene, generate informa- ‘ion streams of observed activity, and wo pass the steams 0 a machine Teaming engine, In tur, the machine leaaing ‘engine may’ engage in an undirected and unsupervised learn- ing approach 10 leam patterns regarding the object behaviors inthat scene. Thereafter, whea unexpected (i.e, abnormal or ‘unusual) behavior is observed, alerts may be generated, [0016] In one embodiment, ez. the machine learning engine may include statistical engines for generating top logical feature maps based on observations, 35 discussed above, and a detection module for detecting feature anoma- Ties. The detection module may be configured to caleulate rarenessvaluos for abserved foreground objects using featare ‘maps generated by statistical engines. rareness value may indicate how anomalous or unusual an foreground object is siven the object's feature(s) and locations), as opposed to, ‘eg, the object’s kinematic properties. In one embodiment, the rareness value may be determined based at least on 3 pseudo Mahalanobis measurement of distance of postion- Teatue vector of the foreground object toa clister associated ‘witha smallest mean-squared error between the eluster and ‘the position-featue vector, andon statistical relevance of any clusters associated with menn-sguared rors Jess than a ‘eshold. Further, the sensitivity of detection may be adjusted according to the relative importance of recently jbserved anomalies. In particular, the detection module may US 2014/0003710 AI become less sensitive (© anomalies which have receatly ‘occurred frequently and vie versa, 10017] Inthe following, reference is made to embodiments ‘of the imention. However it should be understood thatthe invention isnot limited to any specifically described embodi- ‘meat. Instead, any combination ofthe following features and ‘elements, whether related to diferent embodimiens or no, is ‘contemplated to implement and pratice the invention, Fur- thermore, in various embodiments the invention provides numerous advantages over the prior art. However, although ‘emboviiments ofthe invention may achieve advantages over ‘other possible solutions and/or over the prior at, whether oF not particular advantage is achieved by given embodiment js no Fimiting ofthe invention, Ths, the Following aspects, ‘atures, embodiments and advantages are merely illustrative ‘and are not considered elements or limitaions of the appended claims except where explicitly recited inaelaim(). esse, reference to “the invention” shall not be constried as. generalization of any inventive subject matter disclosed hhrein aad shall not be considered to be an element obi tation ofthe appended claims except where explicitly recited ‘na claims). 10018} | Oncembodimentof the invention s implemented as program predut fr use with a computer system. The pro- ram(s) of the program product defines lunctions of the ‘embodiments (itloding the methods described herein) and ‘ean be contained on a variety of computer-teadable storage media, Examples of computer-readable storage media include () non-wriable stonge media (eg. read-only memory deviees within a computer such as CD-ROM or DVD-ROM disks readable by an optical media drive) on ‘which information permanently stored: (i) writablestorage media (e, floppy disks within a diskette drive or hard-disk dive) on which alteable information is stored. Such com- pulerreadable storage media, when carrying computer-read- ble instructions that direet the functions ofthe present inven- tion, are embodiments of the present inveation. Other ‘examples media include communications medi through hich information is conveyed to acomputer, suchas through ‘computer tclephone network, inchading wireless comm- nications networks. 10019] In general, the routines executed to implement the ‘embodiments of the invention may be part of an operating system ora specific aplication, component, program, mode tule, object, or sequence of instructions. The computer pro- fram of the present invention is comprisd typically of @ ‘multitude ofinstactions that willbe teanslated by the native ‘compute into « machine-readable format and hence execut- able instructions. Also, programs are comprised of variables, ‘ancl data structures tat ether reside locally tothe program or are found in memory of on storage devices. In addition, ‘various programs described herein may be identified based "upon the application for which they are implemented in 2 specific embodiment ofthe invention. However, itshould be appreciated that any particular program nomenclature that {allows is used merely for convenience, and thus the inven- tion shoud not be imited to use solely in any specifi appli ‘ation identified and/or implied by such nomenelatre [0020] FIG. 1 illustrates components of a video analysis and. behavior-recognition system 109, according to one ‘embodiment ofthe presen invention, As shown, the behav= ‘or-recognition system 100 includes video input source 108, network 110, a computer system 118, and input and output devices 118 (eg, a monitor, a keyboard, mouse printer, Jan. 2, 2014 ‘ecorded by the video input 108 t the computer sy Ilustratively, dhe computer system 115 includes a CPU 120, storage 125 (e.g..a disk deve, optical disk drive. floppy disk rive and the like), and a memory 130 which includes both a ‘computer vision engine 138 and mochine-leaming engine 140. As described in greater detail befow, the computer vision ‘engine 13Sandthe machinc-leamning engine 140 may provide soltware applications configured to analyze a sequence of video frames provided by the video input 108. [021] Network 110 receives video data (eg, video stream (3), ¥ideo images or the ike) from the video input source 105. The video input source 108 may be a video camera, a VCI DVR, DVD, computer, web-cam device, or the like, For example, he video input source 108 may be a stationary Video camera med ata certain area (ea subway station, ‘parking lot, «building entry/exit, et), which records ‘events taking place therein, Generally, the area visible to the ‘camera is reeired to a the “scene.” The video inpt source 105 may be configured to record the seene as a sequence of individual video frames at a specified frame-mte (eg. 24 {ames per second), where each frame includes a fxed aum- ber of pixels (e., 320x240). Fach pixel of each frame may specify a color value (an RGB Value) or grayseale value (ea, a radiance value between 0-255). Punter, the video stream may be formatted using known formats ineluding -MPEG2, MIPEG, MPEG, 1.263, 1.264, and the like. [0022] As noted above, the computer vision engine 135 ‘may be configured to analyze this a information to identify active objects in the video steam, identify a variety of appear fance and Kinematic Features used by & machine learaing engine 140 to derive object classifications, derive a variety of metadata revarding the actions and interaetions of such ‘objects, and supply tis information tothe machine-learaing engine 140. And in tum, the machine-leaming engine 140 ‘ay be configured to evaluat, observe, learn and remember elas regarding events (and types of events) that tanspire ‘within the seene overtime, [0023] In one embodiment, the machine-lesning engine 140 receives the video frames and the data generated by the computer vision engine 138. The machine-Ieaming engine 140 may be configured to analyze the received dat, lister ‘objects having similar visual andor kinematie feature, build semantic representations of events depicted in the video frames, Over time, the machine leaning engine 140 leas ‘expected pattems of behavior for objets that map toa given cluster, Thus, overtime, the machine leaning engine leaas Irom these observed patterns to identify normal uad/or abnor ‘mal events, That is, rather than having patterns, objocls, object types, or activites defined in advance, the machine Jeaming engine 140 builds its own model of what diferent object types have been observed (e-., based on clusters of kinematic and or appearance Features) as well as 8 model of expected behavior fora given object type. [0024] In genera, the computer vision engine 138 and the nachine-leaming engine 140 both process video data in eal= ‘ime. However, time scales for processing information by the ‘computer vision engine 138 and the machine-leaning engine 140 may difer. For example, in one embodiment, the com puter vision engine 135 processes the received video data ‘rame-by-frame, while the machine-learing engine 140 pro- cesses data every N-frames. In other words, while the com- puter vision engine 135 may analyze cach frame in real-time {o derive a se of appearance and kinematic data related 10 US 2014/0003710 AI ‘objects observed inthe frame, the machine-lerning engine 140s not constrained by the real-time frame rateof the Video input, 10025] Note, however, FIG. 1 illustrates merely one pos- sible arrangement of the behavior-recognition system 100. For example, although the video input source 108 is shown ‘connected tothe computer system 115 via the network 110, the network 110 is not always present or needed (e-., the video input source 108 may be directly connected tthe ‘computer system 118). Further, various components and moddles of the behaviorsecognition system 100 may be implemented in other systems. For example, in one embodi- ment, the computer vision engine 135 maybe implemented as ‘par of a video input device (ez. a8 a finmware component ‘wired directly into video camera) In such acase, the output ‘ofthe video camera may be provided to the machine-learuing ‘engine 140 for analysis. Smiley, the out from the com= puler vision engine 136 and machine-learaing engine 140 tay be supplied over computer network 110 to ater come puter systems, For example, the computer vision engine 138 ‘nd machine-lesning engine 140 may be installedon server system and configured to process video fy sources (ce. rom multiple cameras). Ins application 250 running on anther computer system may roguest (or receive) the results of over network 110. 10026) FIG. 2 further illustrates components of the com- puler vision engine 138 and the machine leaning engine 140 firsillustrated in FIG. 1, acconting to one embodiment ofthe invention, As shown, ‘the computer vision engine 138 des. a background foreground (BG/FG) component 208, ‘tracker component 210, an esimatorideniier component 215, and a context processor component 220. Collectively, the components 208, 210, 215, and 220 provide a pipeline for processing an incoming Sequence of video frames supplied by the video input source 108 (indicated by the solid arows Jinking thecomponents). Additionally, theoutpat of onecom- ponent may be provided to multiple stages of the component Pipeline (as indicated by the dashed ares) as well aso the ‘machine-learing engine 140. In one embodiment, the com- ponents 208, 210, 218, and 220 may each provide a software ‘module configured to provide the functions described herein. ‘Ofcourse one of ordinary skill in the art will recognize that the componeats 208, 210, 215, and 220 may be combined (oF further subdivided) t0 suit the nocd ofa particular ease and further that addtional components may be added (or some may be removed) 10027] In one embodiment, the BG/FG component 208 ‘ay be configured to separate cach frame of video provided by the video input source 105 into a statie part (the scene background) and collection of volatile prs (the scene orepround.) The frame itself may neludeatwo-timensional uray of pixel values for multiple chamels (eg, RGB chan- nels fr coor video or grayscale channel or radiance channel for black and white video). In one embodiment, the BG/FG ‘component 208 my model background states for each pixel using an adaptive resonance theory (ART) network, That i, ‘each pixel may be classified as depicting scene foreground oF scene background using an ART network modeling a given pixel. Of course, other approsches to distinguish between Scene foreground and background may be use 10028] Additionally, the BG/FG component 208 may be ‘configured to generate mask used to identity which pixels of the scene are classified as depicting foreground and, con- versely, which pinels are classified as depicting scene back= Jan. 2, 2014 round. The BG/FG component 205 then identifies regions of the seene that contain a portion of scene foreground (relered toasa foreground “blob” or “patch” and supplies this infor- ‘ation to subsequent stages of the pipeline. Additionally. pixels classified as depicting scene background maybe used to generate a background image modeling the seen. [0029] ‘The tracker component 210 may receive the fore- around patches produced by the BG/FG component 208 and ‘genemte computational models forthe patches. The tracker ‘component 210 may be configured to use this information, ‘and cach successive frame of raw-video, to attempt to trick the motion ofan object depicted bya given foreground patch as it moves about the scene. That i, the ticker compone: 210 provides continuity to other elements of the system by (cocking @ given object from frame-to-frame. [0030] The estimatoeidentitier component 218 may receive the output of the tracker component 210 (and the BF/FG component 208) and identity a variety of Kinematic andlor appearance feats ofa foreground object. Appeat ance features identified may include, but are not limited 10, area derivative (i.e, a change in bounding box size of & tracked object), shadow (e.g, a percentage of the foreground object covered by shadow pixels), shininess (ei, based on specular reflection ofan objec) internal energy (eg. based ‘on how dierent an object appears in consecutive frames asa resultof translations andor rotations), area e.g, an area ofan object, in pixels, divided hy an area ofits Bounding box). entropy e-., based on thecolorfulness of an object), gradient histogram (eg. basedon bow horizontalertical an objector its edges are), color variation (e., based on the chromatic appearance of the objec) ad hve ofthe objet, In genera such appearance features may characterize the appearance oF the foreground objec, as opposed to is kinematics. In some tembodinients, a numberof features may be used to provide diversity, and features may be chosen that are reliable, as ‘opposed to noisy. Further, the appearance features may be sed 0, eg, Jeam the appearance properties ofthe scene and identify feature anomalies given the earned appearance rop- erties, as discussed in greater detail below. [0031] The context processor component 220 may receive ‘the oulput From other stages ofthe pipeline (i.e. the tracked objects, the background and foreground models, and the eslts ofthe estimator/idemfier omponest 218). Using this information, the context processor 220 may be configured t0 zenemitea sream of context events regarding objects tracked (by wacker component 210) and evaluated (by estimator iden- tier component 218). For example, the context processor ‘component 220 may package a stream of micro-festure vee- torsind kinematie observations ofan object and ouipu this to themachine-eamingengine 140, ..,atarateof' tz. Inone embodiment, the context events are packaged asa trajectory. ‘Assad herin, a trajectory generally refers toa vector pack ‘aging the kinematic data of a particular foreground object in stecessive frames or samples. Fac element in the trajectory represents the Kinematic data captured for that object at particular point in time. Iypicaly, a complete tajector Jncludes the kinematie data abtained when an objet is fi ‘observed in a frame of video along with each successive ‘observation ofthat object up to when it leaves the scene (oF ‘becomes stationary tothe point of dissolving into the frame background). Accordingly, assuming computer vision engine 135s operating a rate ofS Ha trajectory fran abject is ‘updated every 200 milliseconds, until complete. US 2014/0003710 AI 10032] | Thecomputer vision engine 13Smay take the output from the components 208, 210, 218, and 220 describing the ‘motions and aetions of the tracked objects i the seene and supply this information tothe machine learning engine 140. Ilustatively, the machine-learaing engine 140 includes long-term memory 228, a perceptual memory 230, an epi- sogie memory 235, a workspace 240, codelets 248, a micro- eaure classifier 285, a cluster layer 260 and a sequence layer 265. Adltionaly, the machine-Ieaening engine 140 includes ‘client application 250, allowing the user to interact with the Video surveillance system 100 using a graphical user inter ‘ice. Purter sil, the machine-Leaming engine 140 includes fn event bus 222. In one embodiment, the components ofthe ‘computer vision engine 138 and mochine-leaming engine 140 onput data to the event bus 222. At the same time, the ‘components of the machine-leamning engine 140 may also subseribe 0 receive different event streams from the event bus 222, For example, the micro-festre classifier 258 nay subseribe to receive the miero-feature vectors output rom the ‘computer vision engine 135, 10033] | Generaly. the workspace 240 provides a computa- tional engine for the machine-learning engine 140. For ‘example, the workspace 240 may be conligured io copy infor- tation from the perceptual memory 230, retrieve relevant ‘memories from the episode memory 235 and the long-tem memory 225, select which codeleis 245 to execute. Fach ‘odelet 245 may bea software program configured to evaluate different sequences of events and to determine how one sequence may follow (or otherwise relate to) anotber (ee. 8 finite state machine). More generally, each codelet my pro- vide a software module configured o detect interesting pat- tems from the streams of data fed to the machine-learning ‘engine. In tur, the codelet 245 may ereate, retrieve, rein- {oree, or modify memories inthe episodic memory 238 and the long-term memory 228. By repeatedly scheduling code- Jets 248 for exccution, copying memories and percepts ‘oilrom the workspace 249, the machine-learning engine 140 performs a cognitive eyele used to observe, and learn, about patter of bebavior tht occur within the seen. 10034] Inone emboctiment, the perceptual memory 230, the ‘episodic memory 238, and the long-lerm memory 228 ure used (0 identify patterns of behavior, evaluate events that transpire in the seene, and encode and store observations. ‘Geoerally, the perceptual memory 230 receives the output of the computer vision engine 135 (eg, the context event stream). The episodic memory 235 stores data representing ‘observed events with details related lo a particular episode, ‘eg. information describing time and space details related (0 fan event. That is, the episodic memory 288 may encode specific details of a panicular event, i, “what and where” something occurred within a scene, sich as a particular vehicle (cat A) moved fo location believed to be a parking space (parking space 5) at 9:43 AM, 10035] In contrast, the long-term memory 228 may store data generalizing events abserved inthe scene. To continve with the example ofa vehicle parking, the long-term memory 225may encode information capturing observations and gen- ‘eralizations learied by an analysis of the behavior of objets inthe scene suchas “vehicles in certain arets ofthescene tend toe in motion,” “vehicles tend to stop in certain ates ofthe scene,” et, Thus the long-term memory 228 stores observa- tions about what happens within a scene with much of the particular episodic details stripped away. In this way. when 2 few event cveurs, memories from the episodic memory 238 Jan. 2, 2014 and the long-term memory 225 may be used t0 relate and ‘understand a cureent event, i, the dew event may be com- pared with post experience, leading 10 both reinforcement, cay, and adjustments tothe information stored in the long- term memory 225, overtime. In particular embodiment, the Jong-term memory 225 may be implemented as an ART net- work and a sparse-steibuted memory data stractre [0036] The micro-feature classifier 258 may sehedule « delet 248 to evaluate the miiero-feature vectors ouiput by the computer vision engine 138, As noted, the computer vision engine 138 may trck objects frame-o- frame and wen- erate micro-feature vectors for each foreground object at a rate of, eg, 5 Hz. In one embodiment, the miro-festure classifier 255 may be configured to create clusters from this Stream of miero-feature vectors. For example, cach micro- Teature veetor may be supplied to an inpot layer of the ART petwork (ora combination of a self organizing map (SOM) fd ART network used to cluster nodes in the SOM). ln response, the ART network maps the micro-feature vector to fa chister in the ART network and updates that chistr (oF creates new cluster if the input mieno-feature vector is sficently dissimilar tothe existing clusters). Each cluster is presumed to represent a distinct object Iype, and objects sharing similar miero-feature veetors (as determined using the choieeand vigilance parametersof be ART network) my ‘map tothe same cluster [0037] For example, the mier-features associated with observations of maay’ different vehicles may’ be similar ‘enough to map fo the same cluster (or soup of clusters) At the same time, observations of maay different people may snap to a different cluster (or group of clusters) than the vehicles cluster. Thus, each distinct cluster inthe art network generally representsa distinct ype of object acting within the Seene. And as new objects enter the scene, new objet types ‘may emerge in the ART network, [0038] Importantly, however, this approach does not requir the diferent object type classifications to he defined ‘navance; instead object types emerge overtime as distinct clusters inthe ART network, In one embodiment, the micro- ‘eature classifier 285 may assign an object type identler 10 cach cluster, providing a different objecttype foreach cluster inthe ART network [0039] In an altemative embodiment, rather than generate clusters from the miero-Fetures vector direlly, the miero- {eae classifier 255 may supply the micro-feare vectors to a selE-organizing map structure (SOM). In such a ease, the ART network may cluster nodes of the SOM-—and assign an ‘objecttype identifier each cluster. Insucha case, eaehSOM ‘ode mopping tothe same cluster is presumed to represent aa instanceof a common typeof object. [0040] As shown, the machine-learing engine 140 also includes a cluster layer 260 and a sequence layer 268. The cluster layer 260 may he configured to generat custers from the trajectories of objects classified by the miero-eature clas- siflor 255 as being an instance of a common object type. la ‘one embodiment, the luster layer 260 uses a combination of ‘aselFonganizing map (SOM) and an ART network to chaster thekinematie data inthe trajectories. Once the trajectories are clustered, the saquence layer 265 may be configured to gen- erate sequences encoding the observed patterns of behavior represented hy the trajectories, And once generated, the sequence layer may identify segments within a sequen US 2014/0003710 AI using a voting experts technique. Further, the sequence layer 265 may be conligured to identify anomalous segments and sequences [0041] As shown, the machine-leaming engine 140 furhee Jnchides statistical engines 270 and a detector module 280. ach statistical engine may be feaure-specti (nique to 1 given feature). In addition, each statistical engine may include an ART network which penerates and modifies elus- s based on observations. In such an ART network, each ‘luster may be characterized by a mean anda variance froma. prototype input representing the cluster. The protorype is enerted first asa eapy of the input Veetor used 0 create a ew cluster Subsequently as new inp vectors are mapped the cluster, the prototype input (and the mean and variance for the cluster) may be updated (ic, modified) by the ART net- ‘work using the new input vector. Initially, the ART network ‘ay be pitted to mature over a period of ime (es, days), and anomaly alerts pertaining to the ART network may be suppressed during this period [0042] In one embodiment, the mean and variance of elus- ters may be the actual mean and variance of inp vectors that have mapped ta the chistes. For example, when an input vectormaps to chisar, the cluster's mean X, may'be updated a follows ‘where n isthe number of feature vectors that have mapped to the cluster and, isthe previows mean (note, this differs from ‘ypieal ART networks, in which the mean is updated as e118, (-clnpat where esa constant in[0.1)). Funer the cluster’ variance.” may be update as follows: Where ois the previous variance 10043) "In another embodiment, the statistical engine may ‘weight the importance ofeach luster by statistical relevance. For example, the statistical engine may keep counts of how ‘many input vectors map to each cluster based on vigilance test. Insuch acase, clusters associated with higher ‘counts may be considered more relevant and viee versa. [0044] _Ingeneral, each ART network cluster may represent type of input vector, discussed in greater detail below, and ‘input vectors which map toa cluster may vary somewhat feature and/or location value(s). Fer example, car objects having approximately 4 given shininess feature value and appearing at approximately a given location inthe seene may map to the same ART network cluster. However. the cat ‘objects may aot have exaely the same shininess andor loca tion values. Such variations may affeet the mean andlor & variance ofthe cluster (e ., greater variation in featureand/or Jocation values may result in a greater cluster varianee). {0045} In one embodiment, the inpst and prototype input vectors are of the form (x, y H, where x and y indicate a location af a given abject (eg, acentrod ofthe object) and F js. a value of a feature of the object, as determined by the cstimatr/identifier component 215. Her, the location (x,y) may represent the centroid (ie, center of mass) ofthe fore- Jan. 2, 2014 sound object as it appears inthe video frame. And the value T may be & value of a feature of the object, which in one embodiment my lie in the ange [0, 1], where 0 may gener- ally represent the absence ofa feature ora feature character istic (ea, the absence of shinines, the abseace of vertical ‘edges te), | may represen the presence ofthe feature, and (0.5 may represent uncertainty about the presence ofthe fea- ture [0046] In another embodiment, the input and prototype ‘input vectors may include more than one feature value and be ‘of the form(s, Ina further embodiment, time(s) may also be ‘nclded in the input vector. In yet nother embodiment, the feature(s) used may include one or more of the following: area derivative, shadow, shininess, internal energy, area, entropy, gradient histogram, color variation, and hne [0047] In addition, each statistical engine 270 may be con- Figured w reinforce, decay, merge, and remove clusters gen- erated by the ART network to improve the robustness and ‘quality ofleaming. Forexample, two clastrs may be merged ifyasedon their means and variances, the clusters averlap in 3 (ie. (x,y. f)Jorhigherdimeasional space. Inoncembodi- ‘ment, the overlapping) may be required to reach a given threshold, which may be implemented by introducing a con- stant factor jc} and requiring thatthe mean of one cluster i be within Bo, *ofthemean of te other cluster u, (Le, within times the variance of the other cluster) forthe clusters to be merged. Reinforcement, decay, menging. and removing of clusters may also be performed according to approaches dis- cussed in US. Pat. No. 8,167,430, hereby incorporated by reference in its entirety [0048] In general, earning by the statistical engines 270 may be aetivity-driven, For example, luster reinforcement, decay, ete, may occur less quickly if fewer objects are ‘observed over a period of time, and vice versa, Further, the Statistical engines 270 may attempt to avoid over-leaaing ‘when there isan abundance of setvity by generalizing more in such eases [049] As discussed, the statistical engines 270 engage ‘unsupervised learning of the appearances and locations of objects ina scene to generate topological feature maps based ‘on ART network clusters. Each feature map may include one ‘or more position-feamre clusters deseribed by their moans And variances and ereatedimodified by the statistical engine, ‘and be un-biased inthe sense that environmental and tech logical influences arcleamed such that repetitive mistakes are orgiven (because they become normal) For example, tracking of an abject which produces incorrect (x,y) coordi- rates, weather changes, ete, may cause flse-positives in a ‘eaditional video analytics system where patterns and maps are manually defined. By contrast, the approach discussed herein Jeams such objectracking mistakes, weather changes, et, such tht they affect the video analytes system less. [0050] The detector module 280 assy be configured 1 elect and report feature anomalies, a discussed in greater ‘etal below. Thats, the detector module 230 may determine ‘whether one of more feature properies ata given location are ‘unusual or anomalous relative to previously observed feature properties and theit locations. If @ feature anomaly is etevted, the detector module may further repor the anomaly by, for example, issuing an alert to-a user interface of the GUvoutput cleat application 250, US 2014/0003710 AI Detweting and Reporting Feature Anomalies in a Machine-Leaming Video Analytics System 10051] _Asnoted above, a machine-learing video analties system may be configured to uses computer vision engine © ‘observe a scene, generate information streams of observed, ‘activity, and to pass the streams to a machine learning engine. Tinta, the machine leaning engine may engage in an andi rected and unsupervised learning approach to learn patterns regarding the object hhaviorsin hat scene. Thereafter, when ‘unexpected (Le, abnormal or unusual) behavior is observed, alerts may be generated, 10082] In one embodiment, eg, the machine learning ‘engine may include statistical engines for penerating top0- logical feature maps based on observations, as discussed above, and a detection module for detecting feature anon lies. The detection module may be configured to calulate rareness values for observed foreground objects using feature ‘maps generated by statistical engines. A arenes value may indicate how anomalous of unusual an foreground abject is sven the object's feature(s) and location(s), as opposed to ‘eg. the objet’s kinematic properties. In one embodiment, the rareness value may be detenmined based at least on 3 pseudo: Malalanobis messurement of distance of a postion- Tere vector of the foreground object tow cluster associated ‘with smallest mean-squared error between the chister and the position-feature vector, and on statistical relevance of any clusters associated with mean-squared errors less than & threshold. Funber, the sensitivity of detection may be adjusted according (0 the relative importance of recently ‘observed anomalies, In particular, the detection module may become less sensitive t© anomalies which have recently ‘occurs requently ad vive versa, 10083] FIG. 3 illustrates a method 300 for detecting and reporting feature anomalies, according to one embodiment. As shown, the method 300 begins at step 310, where a detec tion module receives kinematic and fentre data for a fore- ground object in a video frame. As discussed, 9 computer Vision engine may’analyze the video frame extract foreground ‘objects ina frame and to derive a set of feature and kinematic data related tothe foreground objects. The detection module may coveive such data for processing, {0084} —Atstep 320, the detection module loops through one ‘or more features. In genera, the detection module may not process every festure for which data is received, For example, ‘certain features may be more relevant than other features for ‘detecting anomalies within a scene. In one embodiment, the ‘eatue(s) processed may inelude one or more (ora combina tion) of the following: area derivative, shodow, shininess, ‘internal energy, area, entropy, gradient histogram, and hue. 0085] _Aé step 330, the detction module determines posi tion and feature values fora feature. Tn ane embodiment, the detection module may parse the kinematic and feature dats received at step 310 (0 determine a three-dimensional posi- tion-feature vector (x,y, Which represents the location and ‘eature values for processing. In one embodiment, the loca- tion (x,y) may represent the centroid (i.e. center of mass) oF the foreground object as it appears inthe video frame, while ‘may boa value ofa feature of theobject, which may, eg ie inthe range [0,1], with 0 generally representing the absence ‘ofa feature or a feature characteristic (e.g, the absence of shininess, the absence of vertical edges et.) I may represent the presence ofthe feature, and 5 may rpresent uncertainty about the presence of the Feature. Jan. 2, 2014 [0056] At step 340, the detection module retrieves (ie., Iakes lca copy of) featre map forthe fete. As sincose the etre map my be gered by a statin ine specific othe feature. Further, the ete apy include one or more cher described by thet meas and ‘rinocs and ees modified by the ttl engine [0057] Atstep 350, the detection module modifies the local feamie sap bined on the fequeney of obeved feaire noma In general the tection mole may my th Ice etre map to account forthe Tequeney of ety tered anomals. For example, more eqn observed nomics may bees tmportan and vie wert ects the tor frequently an "anomaly observed theless aoa ious it bovomes In ne embodiment, the detection module say modi the Festure map io account fo the repeney of ‘evel (eg within the pat 1O minutes) observedanomalies by inresing the varines (eg, by moltpying he variance byaconstant of cluster astoitd with retermumber of ‘Remy observed anomalies: Decay may farther be but into this proces sch tha the variance of eister associated with {ever ocnlly observed anomalics are nseased less fr not increased tal). [0058] "At stp 360th detection modal determines i fhnces tween the postonfetire vector anders of the Slit festre map. In one embodiment, the detection ‘dole may cxulte mean-aared rors bsevgente pos thse vector (x, ol the min ale Gy) OF cach of the chester [0059] At step 365, the detection module selects eluster(s) surrounding the position-festure vector and further selects luster closest tothe postion-feature veto. In one embod ‘ment, the detection modole may select cluster(s) associated ‘with mean-square error(s) less than thresholds) asthe sur- rounding eluster(s). Here, the threshold(s) may be cluster specific, and may be, eg, a predefined number for unmerzed clusters and the value of the cluster variance for merged clusters. If one of more calculated distances (eg, mean- squared error values) are less than the threshol(s), then detection module may select the chisters associated With those values. Furher, regardless of the thresholds), the detection module selecis the cluster associated with the smallest mean-squared error (also refered to herein as the “closest cluster"). OF course, persons skilled in the at will recognize that distance measurements other than mean- squared error may instead be used to select closter(s) for purposes of selecting the closest cluster andlor the surround ing eluster(). [0060] In another embodiment, the detection module may turer apply’ viilance test, suchas that discussed in US. Pat. No. 8.167.430, in mapping the postion-feature vector to the closest cluster andior the surrounding eluster(s). The vigi- Jnce fest may generally compute a similarity between an input vector anda cluster prototype (ea. she prototype vector {or the closest cluster) and determine whether the similarity cexcceds a vigilance parameter. If such isthe eas, the input ‘may be mapped to that cluster. If, however, the input does not ‘match any existing cluster under the vigilance test, anew cluster may be reated by storing. a prototype vector similarto US 2014/0003710 AI the input vector. The vigilance parameter as considerable influence on an ART network: higher vigilance produces ‘many, fne-geined clusters, while lower vigilance resulls ia ‘more-general clusters, [0061] Acstep 370, the detection module determines a rare- ness value based on a distance to the closest cluster and the ‘atistical relevance of the selected clusters) having distance (s) less than the threshold(s). In one embodiment, te detec- tion module may determine the raeness valve using the fl- owing forma Raraen t= [ae dpiore (t=) Apart Ya evans of chiens) rings ha esol) posterior is the psetdo-Mabslanobis distance of the pos tion-feature vector to the closest cluster, and «isa constant. normal (Le. not rare) observation and 1 indicating rare ‘observation. As discussed, the statistical relevance of a given ‘luster may’bea count ofthe numberof previous observations ‘which have mapped to that cluster (eg, via a choice andor Vigilance test). Note, Aprior may be 0 if, at step 368, the detection module determines that one ofthe calculated dis tance(s) is fess than the threshold(s) In such a casey the rreness value may nevertheless be high because posteriori js calculated based on the closest cluster, which may not ‘tually be a close cluster inthis case. [0062] Ac step 380, the detection module determines Whether a reporting criterion (or erteria) has been met. Por ‘example, the detection module may determine whether the rareness value exceeds (or is less than, depending on the implementation) a threshold (e.g. niney-ninth percentile threshold), Ifthe detection mode determines that reporting criterion/eriteria have been met, then at step 390 the detection ‘module reports. feature anomaly. For example, the deteti module may issue an alent to user interlace, thereby notify~ ing the user of the anomaly. Further, the detection module may throtle or normalize alerts so ast limit the number of alerts (oralets of each type) that are reported. Further yet, the detection module may modify the factor (eg. the mul ‘ative constant) by Which it increases the Variance of the ‘loses cluster at stop 380 to scootnt forthe extent fate ‘anomaly 10063] I the detection module determines st step $80 that reporting citrioncrteria have not been met, or after repot- ingan anomalous event at step 39, the method 300 continues at tep 392, where a feture-spevifi statistical engine updates the feature map based on the current feature and location data As discussed, the statistical engine may include an ART net~ ‘work which creates modifies clusters based onthe feature and location data. Inaddition, the statistical engine may reinforce, ‘decay, merge, or delete clusters ofthe ART network based on the feature and location data 10068] Ac step 395, the detection module determines ‘whether there are more features to analyze. If there are more ‘Features to analyze, the method 300 returns to step 320 10 process data associated with another feature. Otherwise the ‘method 300 ends thereafter. Jan. 2, 2014 [0065] F1G.4ithstratesanexamplefeanve map being sed to determine feature anomalies, according toan embodiment As shown, the feture maps associated witha scene 40 and ‘cludes chnters 410, , described by their mesos ad tariances 2 The feature map may be generated by 0 Sutistial engine specific 10 one or more Fenures, such as awa, atea derivative, shadow, shininss,iatemai enor, atopy, grident histogram, color variation, or ve. Aso esse the statistical engine may be configured to engage in ‘supervised learning ofthe appearances and locations of objets in scene to generate the fetire maps based on ART elwork clsters, none embodiment, the statistical engine Inay einfoee, day, merge, and remove chustrs genes bythe ARTaciwork. Theslatstical engine may also rack the Statistical significance ofthe chistes based on, e, counts 1, of vectors Which have mapped to those cluster. [0066] "Given an input positonfeture vet (x the detection male may determine a rrenes valde bssed on distance tothe closest closer and statistical relevance of the Selected chstrts) having dstance(s) less hana tresoldT, ‘according equation (4) Ihsratvely,pseodo- Mahalanobis stance totheclster410,, aswell asthe statistical relevance ngof the liste 410, may be used n uation (4), as ehser io, isthe closest cisterto the example inp vector ( ¥F fn the only closer within threshold distance Tt the apt vector (3. {0067} "As discussed, the detection module may report an onus event (eq by sing an alert oa user nterface) ithe rreness vals determin sing equation (4 exceeds (or's below, depending on the implementation) a threshold ‘al. Fut, he statistical engine may update the festre ‘nap which includes clusters 40, , based onthe input po tion-feature vector (x,y a8 discussed above {0068} Although discussed above with respect to video Frames, on-video data nay also beuse Fr example. amp ‘may be sed in evo «ideo feame, and he feature used may clude global positioning systens (GPS) coordinate, radio {resqueney identification (RFID) tas, andthe like {0069} "While the foregoing is dnected vo embodiment of the present invention, other and Firther embodiments of the javention may be devised without departing from the base scope thereof and the scope treo is determined by the lain that ola ‘What seamed is 1Acomputerimplementes| method for analyzing scene, the method comprising ‘receiving kinematic and feature data for an object i the

You might also like