[LeetCode] Find Duplicate File in System 在系統中尋找重復文件

本文轉載自查看原文 2017-06-14 11:35 3774 LeetCode

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Note:

No order is required for the final output.
You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
The number of files given is in the range of [1,20000].
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

Follow-up beyond contest:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

LeetCode的主頁又改版了，放了一些五顏六色的按鈕上去了，博主個人覺得風格不太搭，還是比較喜歡之前深沉低調的風格，不過也許看久了就習慣了。來看題吧，這道題給了我們一堆字符串數組，每個字符串中包含了文件路徑，文件名稱和內容，讓我們找到重復的文件，這里只要文件內容相同即可，不用管文件名是否相同，而且返回結果中要帶上文件的路徑。博主個人感覺這實際上應該算是字符串操作的題目，因為思路上並不是很難想，就是要處理字符串，把路徑，文件名，和文件內容從一個字符串中拆出來，我們這里建立一個文件內容和文件路徑加文件名組成的數組的映射，因為會有多個文件有相同的內容，所以我們要用數組。然后把分離出的路徑和文件名拼接到一起，最后我們只要看哪些映射的數組元素個數多於1個的，就說明有重復文件，我們把整個數組加入結果res中，參見代碼如下：

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        vector<vector<string>> res;
        unordered_map<string, vector<string>> m;
        for (string path : paths) {
            istringstream is(path);
            string pre = "", t = "";
            is >> pre;
            while (is >> t) {
                int idx = t.find_last_of('(');
                string dir = pre + "/" + t.substr(0, idx);
                string content = t.substr(idx + 1, t.size() - idx - 2);
                m[content].push_back(dir);
            }
        }
        for (auto a : m) {
            if (a.second.size() > 1)res.push_back(a.second);
        }
        return res;
    }
};

參考資料：

https://discuss.leetcode.com/topic/91430/c-clean-solution-answers-to-follow-up

https://discuss.leetcode.com/topic/91301/straight-forward-solution-with-a-tiny-bit-of-java8

LeetCode All in One 題目講解匯總(持續更新中...)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Find the Duplicate Number （尋找重復數字） LeetCode 287. Find the Duplicate Number （找到重復的數字） [LeetCode] Contains Duplicate 包含重復值 [LeetCode] Delete Duplicate Emails 刪除重復郵箱 The system cannot find the file specified [LeetCode] 1002. Find Common Characters 尋找相同字符 [LeetCode] 219. Contains Duplicate II 包含重復值之二 [LeetCode] Contains Duplicate III 包含重復值之三 [LeetCode] 1044. Longest Duplicate Substring 最長重復子串 [LeetCode] 153. Find Minimum in Rotated Sorted Array 尋找旋轉有序數組的最小值