如何对大文件进行排序

Last updated on 8 months ago

排序问题虽然有了解，很多只是基于理论上，实践比较少，刚好最近遇到个问题：

这个以前遇到过类似的问题，假如你有一个20GB的文件，对文件的内容排序，内存限制在 1G，你要怎么排序

以前有做过总结，就是分割文件，逐个文件先排好序，最后然后再合并，说起来很简单，但做起来就。。。。

这次题目有分割文件，我下载了这个压缩包，解压看有1000个 20M的文件

思路：将这个一千个文件，文件名读取出来，按照条件逐个抽出来，并且排序

void read_lryic(char *path)
{
    DIR *dir = opendir(path); //打开目录文件
    cout << path << endl;
    struct dirent *entry;
    int i = 0;
    while ((entry = readdir(dir)) != 0)
    {
        if (strstr(entry->d_name, ".txt"))
        {
            filename.push_back(entry->d_name);
            printf(" %d %s\n", i++, entry->d_name);
        }
    }
}

现在文件名有了，进行筛选排序，这里用map来存放临时的数据，key 为时间戳，value 为对应的数据，为什么用map，map是会自动自动排序，根据key的值，而这里的没条数据都是 json，用了一个jsoncpp 库（如果用 go 估计已经写完了），原本想直接解析字符串，发现有点麻烦，还是用库吧，提取出时间戳，作为key，提取完一个文件后，重新写入到新的文件里

void sortfile(){
    Json::Reader reader;
    Json::Value obj;
    //读文件名字 
    read_lryic("/home/hrp/Cpp/recore/");
    string file;
    for (auto str : filename){
        file = "/home/hrp/Cpp/recore/" + str;
        std::ifstream fIn(file);
        int i = 0;
        if (fIn)
        {
            std::string str;
            while (std::getline(fIn, str))
            {
                if(str.find("张三") != string::npos){
                    
                    //std::cout << i++ << endl;
                    reader.parse(str, obj);
                    long timestamp = obj["timestamp"].asInt64();   
                    ret[timestamp] = str;             
                }
            } 
            cout << file << endl;
        }
        else
        {
            std::cout << "Open file faild." << std::endl;
        }
        ofstream out1(file);
        for(auto i=ret.begin();i!=ret.end();i++){
            //cout << i->second;
            out1 << i->second ;
            out1 << "\n" ;
        }
        ret.clear();
        out1.close();
    }
}

现在完成按照条件完成每个文件的排序了

接下来就是归并，如何归并呢，现在有1000个文件，逐个遍历这1000个文件，每次都提取一行来，一次遍历提取了1000行数据，此时可以用大顶堆的特性，就是1000个数据放进这个大顶堆，堆顶肯定是最大数，按照这个特性，先读取一千条，放在大顶堆了，每次 pop出堆顶的（肯定最大），然后要补上一个，读哪个文件的呢，肯定是pop出去哪条数据所在的文件，所以现在就是要去那个文件里再读一条数据，以此类推，说是这样说，但是这题还有其他附加条件（根据时间分类写）

clock_t start,end;//定义clock_t变量
   start = clock();  //开始时间

   Json::Reader reader;
   Json::Value obj;

   map<string,ifstream*> filesmap;
   map<string,ofstream*> outfile;

   sortfile();
   time_t tt;
   struct tm *ttime;
//用 map 保存文件的读写流，这里是读
   for(auto file: filename){
       auto temp = file;
       file = "/home/hrp/Cpp/recore/" + temp;
       filesmap.emplace(file, new ifstream(file)); 
   }
//这里用c++ 的优先队列 来模拟堆，插入的数据  为 时间戳和文件的字符名字
  	priority_queue<pair<long, string>> topTime;
   map<long,string> Temp;
   string data;
//先读取 全部文件的 第一条数据，push到 优先队列中 
   for(auto iter = filesmap.begin(); iter != filesmap.end(); iter++) {
       std::getline(*iter->second, data);

       reader.parse(data, obj);
       long timestamp = obj["timestamp"].asInt64();
       pair<long,string> b(timestamp,iter->first);

       topTime.push(b);
       Temp[timestamp] = data;
   }
// 现在 对队列操作，此时队列是 出一个，进一个，维持平衡，没有进，就一直出，知道 队列为空位止
// Temp 维护 这1000个数据，动态增减
   while (!topTime.empty()){
       // cout  << Temp[topTime.top()] << endl;
       //时间戳转 日期
       struct tm *ttime;
       tt = topTime.top().first/1000;
       ttime = localtime(&tt);
       char formatime[64];
       strftime(formatime, 64, "%Y-%m-%d", ttime);
       string strtime(formatime);
	//这里是写文件，根据时间戳 转换的日期来 写文件
       // outfile   key 为文件名 value 对应的读写流
       // 为 0 ，说明第一次，创建文件
       if(outfile.count(strtime)  == 0){
           
           outfile.emplace(strtime,new ofstream(strtime));
           if (!outfile[strtime]->is_open()) 
               cout << strtime << "error" << endl;
           *outfile[strtime] << Temp[topTime.top().first]+ "\n";
           Temp.erase(topTime.top().first);

       }else{
           //根据 日期写入 数据
           *outfile[strtime] << Temp[topTime.top().first] + "\n";
           Temp.erase(topTime.top().first);

       }
	// top是最大的，此时pop出去，记录下 pop去的是哪个文件 
       auto TopTemp = topTime.top();
       topTime.pop();
	
	//如果 读取到了，更行 优先队列的数据
       if(std::getline(*filesmap[TopTemp.second], data)){
           reader.parse(data, obj);
           long timestamp = obj["timestamp"].asInt64();
           pair<long,string> b(timestamp,TopTemp.second);
           Temp[timestamp] = data;
           topTime.push(b);
       }
     

   }
//最后 要关闭文件，才会保存
   for(auto iter = outfile.begin(); iter != outfile.end(); iter++) {
       iter->second->close();
   }
   end = clock();   //结束时间
   cout<<"time = "<<double(end-start)/CLOCKS_PER_SEC<<"s"<<endl;  //输出时间（单位：ｓ）

运行的机器是：腾讯云服务器 4核 4g

耗时 117 .s

排序好的

总结：

认清了知道和做到的差距！！！！！
有些api 日常少使用，很不收悉，都是在查查查
要先有整体思路然后细化问题，逐个解决

代码整体设计不是很好，后续在改进，用多线程多路归并？

#有趣的问题

Linux 常用命令 Previous

模板文件 Next