• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
FlexTk File Management Toolkithttp://www.flexense.com
Rule-Based Duplicate Files Detection and Removal
Detection and removal of duplicate files in enterprise environments is significantly morecomplicated and therefore requires more features and capabilities from a potential solutionto be performed effectively and accurately. In general, Enterprise storage pools may bedivided into two broad categories: organized storage pools and personal storage pools.Organized storage pools are intended for well defined purposes and consequently thestorage hierarchy and directory structures are strictly defined for the designated purposes.Unorganized storage pools are typically used for storing personal user directories and otherunmanaged data.In an enterprise storage environment, duplicate files may be produced by people,applications and operating systems running on personal computers and corporate servers.Operating systems and enterprise applications are operating according to their own hiddenlogic and touching any duplicate files located in operating system directories orapplication-specific directories may be very dangerous and should be avoided. On theother hand, duplicate files located in directories managed by people may be accuratelydetected and removed while preserving access to original files at designated locations.Detection of duplicate files is a relatively simple process – just compare files having thesame file size and you will know exactly which files are identical. The problem begins whenyou need to search for duplicate files among many thousands or even millions of files in anenterprise environment. Only a few duplicate file finders available today are capable of processing more than 100,000 files hardly making it feasible to process large amounts of files stored in a typical enterprise storage environment. For more information about theexpected performance refer to the duplicate files search benchmark.
1
 
FlexTk File Management Toolkithttp://www.flexense.com
 The large number of files to be processed in enterprise storage environments makes itimpossible to manually review all the detected duplicate file sets and therefore requiressome kind of automation that should be capable of:1.Accurately distinguishing between one or more duplicate files and the original filein each duplicate file set.2.Making an automatic selection of user-defined duplicate removal actions for eachspecific duplicate files set according to user-controllable rules and policies.3.Automatically executing duplicates removal actions in duplicate file sets withaccurately detected original files and user-defined removal actions.Suppose you have two duplicate files located in two home directories related to twodifferent users. In this case, it is impossible to make any reliable assumptions which file isthe original and which is the duplicate. Yes, it is possible to compare files’ modificationtimes and make an assumption that the older file is the original, but in this specificsituation it will be better for a human being to make the final decision.Another situation is when you have two or more duplicate files with one of them located inan organized storage pool. For example, suppose we have two documents with one of them located in a user’s home directory and the second located in a designated corporatedirectory intended for business related documents. In this case, it may be assumed quiteaccurately that the file located in the designated directory is the original and the filelocated in the user’s home directory is a duplicate.For additional accuracy, the original detection process may be performed using multiplerules such the file type, location, size, owner, etc. Once we have detected the original filein each duplicate file set, we can assign specific duplicate files removal actions for eachspecific duplicate file type. For example, duplicate documents may be linked to theoriginal, duplicate reports older than 1 year moved to an archive directory and duplicatemedia files (music, videos and images) deleted. The FlexTk file management toolkit allows one to search for duplicate files, accuratelydetect original files in each specific duplicate files set and automatically execute user-defined duplicates removal actions (FlexTk Ultimate only). Now let’s define an exampleduplicate files search command showing how to use all the mentioned features andcapabilities. In order to do that, start FlexTk’s main GUI application, select the user-definedcommands tool pane and select the “Add New – Duplicates Search Command” menu item.On the “Inputs” dialog add all the input directories that should be processed. For thisspecific tutorial we have prepared two directories: the first one (K:\home) containing allusers’ personal directories and the second one (K:\data) contained an organized directorystructure with purpose-specific directories. After finishing adding input directories, pressthe “Next” button.
2
 
FlexTk File Management Toolkithttp://www.flexense.com
 The “General” tab allows one to control the signature type, the file scanning mode, themaximum number of displayed duplicate file sets and the file scanning filter. The signaturetype parameter controls the type of the file signature algorithm used to detect duplicatefiles. The SHA256 algorithm is the most reliable one and it is used by default. In thesequential file scanning mode FlexTk will scan all input directories one after one in theorder as they were specified on the inputs dialog. This is the most effective way to scanfiles located on a single physical disk. If you need to process multiple input directorieslocated on multiple physical disks or an enterprise storage system or a disk array (RAID),use the parallel file scanning mode, which will deliver better performance when processinga large amount of files. The maximum number of duplicate file sets controls the number of duplicate file setsdisplayed on the results dialog. After finishing the search process, FlexTk sorts all thedetected duplicate file sets by the amount of the wasted storage space and displays thetop X file sets as specified by this parameter. The file filter provides the user with theability to limit the duplicates search process to a specific file type or a custom file setmatching the specified file scanning filter. For example, in order to search for duplicate PDFdocuments only, set the file scanning filter to ‘*.pdf’. This file scanning filter will match allfiles with the extension PDF (PDF Documents) and skip all other files. The ‘Rules’ tab allows one to specify multiple file matching rules that should be usedduring the duplicates search process. If there are no file matching rules defined in the‘Rules’ tab, FlexTk will process all file types. Otherwise, FlexTk will process files matchingthe specified rules only. For detailed information about how to use file matching rules referto the advanced, rule-based search tutorial.
3
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...