首页 / PYTHON / Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含5045字，纯文字阅读大概需要8分钟。

内容图文

Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

我目前正在成功使用python 2.7脚本,该脚本以递归方式遍历巨大的目录/文件路径,收集所有文件的路径,获取此类文件的mtime以及具有相同路径和名称但pdf的各个文件的mtime文件进行比较.我在python 2.7脚本中使用scandir.walk(),在python 3.7中使用os.walk(),最近已更新为也使用scandir算法(无其他stat()调用).

但是,脚本的python 3版本仍然明显慢一些！这不是由于算法的scandir / walk部分造成的,而是由于getmtime算法(但是,在python2和3中是相同的调用)或由于处理了巨大的列表(我们在说?).此列表中有500.000个条目).

任何想法可能导致此问题以及如何解决此问题？

#!/usr/bin/env python3
#
# Imports
#
import sys
import time
from datetime import datetime
import os
import re

#
# MAIN THREAD
#

if __name__ == '__main__':

    source_dir = '/path_to_data/'

    # Get file list
    files_list = []
    for root, directories, filenames in os.walk(source_dir):
        # Filter for extension
        for filename in filenames:
            if (filename.lower().endswith(('.msg', '.doc', '.docx', '.xls', '.xlsx'))) and (not filename.lower().startswith('~')):
                files_list.append(os.path.join(root, filename))

    # Sort list
    files_list.sort(reverse=True)

    # For each file, the printing routine is performed (including necessity check)
    all_documents_counter = len(files_list)
    for docfile_abs in files_list:

        print('\n' + docfile_abs)

        # Define files
        filepathname_abs, file_extension = os.path.splitext(docfile_abs)
        filepath_abs, filename = os.path.split(filepathname_abs)

        # If the filename does not have the format # # # # # # # *.xxx (e.g. seven numbers), then it is checked whether it is referenced in the databse. If not, it is moved to a certain directory
        if (re.match(r'[0-9][0-9][0-9][0-9][0-9][0-9][0-9](([Aa][0-9][0-9]?)?|(_[0-9][0-9]?)?|([Aa][0-9][0-9]?_[0-9][0-9]?)?)\...?.?', filename + file_extension) is None):
            if any(expression in docfile_abs for expression in ignore_subdirs):
                pass
            else:
                print('Not in database')

        # DOC
        docfile_rel = docfile_abs.replace(source_dir, '')

        # Check pdf
        try:
            pdf_file_abs = filepathname_abs + '.pdf'
            pdf_file_timestamp = os.path.getmtime(pdf_file_abs)
            check_pdf = True
        except(FileNotFoundError):
            check_pdf = False
        # Check PDF
        try:
            PDF_file_abs = filepathname_abs + '.PDF'
            PDF_file_timestamp = os.path.getmtime(PDF_file_abs)
            check_PDF = True
        except(FileNotFoundError):
            check_PDF = False

        # Check whether ther are lowercase or uppercase extension and decide what to do if there are none, just one or both present
        if (check_pdf is True) and (check_PDF is False):
            # Lower case case
            pdf_extension = '.pdf'
            pdffile_timestamp = pdf_file_timestamp
        elif (check_pdf is False) and (check_PDF is True):
            # Upper case case
            pdf_extension = '.PDF'
            pdffile_timestamp = PDF_file_timestamp
        elif (check_pdf is False) and (check_PDF is False):
            # None -> set timestampt to zero
            pdf_extension = '.pdf'
            pdffile_timestamp = 0
        elif (check_pdf is True) and (check_PDF is True):
            # Both are present, decide for the newest and move the other to a directory
            if (pdf_file_timestamp < PDF_file_timestamp):
                pdf_extension = '.PDF'
                pdf_file_rel = pdf_file_abs.replace(source_dir, '')
                pdffile_timestamp = PDF_file_timestamp
            elif (PDF_file_timestamp < pdf_file_timestamp):
                pdf_extension = '.pdf'
                PDF_file_rel = PDF_file_abs.replace(source_dir, '')
                pdffile_timestamp = pdf_file_timestamp

        # Get timestamps of doc and pdf files
        try:
            docfile_timestamp = os.path.getmtime(docfile_abs)
        except OSError:
            docfile_timestamp = 0

        # Enable this to force a certain period to be printed
        DateBegin = time.mktime(time.strptime('01/02/2017', "%d/%m/%Y"))
        DateEnd = time.mktime(time.strptime('01/03/2017', "%d/%m/%Y"))

        # Compare stimestamps and print or not
        if (pdffile_timestamp < docfile_timestamp) or (pdffile_timestamp == 0):

            # Inform that there should be printed
            print('\tPDF should be printe.')

        else:
            # Inform that there was no need to print
            print('\tPDF is up to date.')


    # Exit
    sys.exit(0)

解决方法:

不知道是什么原因解释了差异,但是即使将os.walk增强为使用scandir,它也不会扩展到进一步的getmtime调用,后者将再次访问文件属性.

最终目标是根本不调用os.path.getmtime.

os.walk中的加速是关于不两次执行统计信息以了解对象是目录还是文件.但是内部的DirEntry对象(由scandir生成)从未公开,因此您无法重用它来检查文件时间.

如果您不需要重新启动,可以使用os.scandir完成：

for dir_entry in os.scandir(r"D:\some_path"):
    print(dir_entry.is_dir())  # test for directory
    print(dir_entry.stat())    # returns stat object with date and all

循环内的那些调用以零成本完成,因为DirEntry对象已经缓存了此信息.

因此,要保存getmtime调用,您必须递归获取DirEntry对象.

没有本地方法,但是这里有示例,例如：How do I use os.scandir() to return DirEntry objects recursively on a directory tree?

这样,您的代码在python 2和python 3中将更快,因为每个对象只有1个stat调用,而不是2.

编辑：编辑以显示代码后,似乎您正在从其他条目中构建pdf名称,因此您不能依赖DirEntry结构来获取时间,甚至不能确定文件是否存在(即使您正在使用Windows,因为文件名不区分大小写,因此无需测试pdf和PDF).

最好的策略是建立一个包含相关时间和所有时间的大型文件数据库(使用字典),然后对其进行扫描.我已成功使用此方法在3500万个文件缓慢的网络驱动器上查找旧文件/大文件.在我的个人示例中,扫描文件一次,然后将结果转储到一个大的csv文件中(花了几个小时,获取了6Gb的csv数据),然后进行了进一步的后处理,加载了数据库并执行了各种任务(由于没有磁盘访问,因此速度更快参与)

内容总结

以上是互联网集市为您收集整理的Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？全部内容，希望文章能够帮你解决Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/649803.html

来源：【匿名】

【上一篇】Python乌龟井字游戏【下一篇】浅谈PHP运行Python脚本的方法

更多 ►

【Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？】教程文章相关的互联网学习教程文章

python中时间戳，datetime 和时间字符串之间得转换【代码】

# datetime时间转为字符串def Changestr(datetime1): str1 = datetime1.strftime(‘%Y-%m-%d %H:%M:%S‘) return str1# 字符串时间转为时间戳def Changetime(str1): Unixtime = time.mktime(time.strptime(str1, ‘%Y-%m-%d %H:%M:%S‘)) return Unixtime# datetime时间转为时间戳def Changestamp(dt1): Unixtime = time.mktime(time.strptime(dt1.strftime(‘%Y-%m-%d %H:%M:%S‘), ‘%Y-%m-%d %H:%M:%S‘)) re...

python关于时间的计算，time模块【代码】

import time time.sleep(5)#以秒为单位暂停 print(time.clock()) # 返回处理器时间,3.3开始已废弃 , 改成了time.process_time()测量处理器运算时间,不包括sleep时间,不稳定,mac上测不出来 print(time.altzone) # 返回与utc时间的时间差,以秒计算print(time.asctime()) # 返回时间格式"Fri Aug 19 11:14:16 2016", print(time.localtime()) # 返回本地时间的struct time对象格式 print(time.gmtime(time.time() - 800000)) # ...

python_datetime模块【代码】

获取当前时间：import datetime # 获取当前时间 ctime = datetime.datetime.now() print(ctime)只显示：年-月-日import datetime # 获取当前时间：只显示年-月-日 ctime = datetime.datetime.now().strftime(‘%Y-%m-%d‘) print(ctime)显示：年月日时分秒import datetime # 获取当前时间：只显示年-月-日-时-分-秒 ctime = datetime.datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S‘) print(ctime) 原文：https://www.cnblogs.com/...

[LeetCode]题解（python）：122-Best Time to Buy and Sell Stock II【代码】【图】

题目来源：　　https://leetcode.com/problems/best-time-to-buy-and-sell-stock-ii/ 题意分析：　　和上题类似，给定array，代表第i天物品i的价格。如果可以交易无数次（手上有物品不能买），问最高利润。题目思路：　　记录当前最小值，如果遇到array[i] < min，那么加上当前的最大值；更新min。代码（python）：class Solution(object):def maxProfit(self, prices):""":type prices: List[int]:rtype: int"""if len(prices)...

Python time模块【图】

Pythontime模块[‘_STRUCT_TM_ITEMS‘,‘__doc__‘, ‘__loader__‘, ‘__name__‘, ‘__package__‘, ‘__spec__‘, ‘altzone‘,‘asctime‘, ‘clock‘, ‘ctime‘, ‘daylight‘, ‘get_clock_info‘, ‘gmtime‘,‘localtime‘, ‘mktime‘, ‘monotonic‘, ‘perf_counter‘, ‘process_time‘, ‘sleep‘,‘strftime‘, ‘strptime‘,‘struct_time‘, ‘time‘, ‘timezone‘, ‘tzname‘]1.time()time模块的核心函数time(...

Python学习笔记：time模块的使用【代码】

在使用python的过程中，很多情况下会使用到日期时间，在Python的自建函数中，包含time模块，用来处理与日期时间相关的功能。1、time.time()　　time()：不能传参数　　用来获取时间戳（即：从1970年1月1日 00：00：00到现在时间的秒数）2、time.localtime()　　localtime()：　　默认获取当前时间的信息，返回格式为元组　　也可以指定具体的时间戳　　如：time.localtime()　　返回当前的时间信息：　　timestr=time.struct_time(...

python 之时间模块 time【代码】

time模块可以用于格式化日期和时间，时间间隔是以秒为单位的浮点小数。每个时间戳都以自从1970年1月1日午夜（历元）经过了多长时间来表示。下面是time模块常用的一些时间格式转换的函数。时间戳可以直接比较大小。 1import time2 3#想时间戳和格式化好的时间互相转换的话，都要先转成时间元组，然后才能转 4print(int(time.time())) #当前时间戳 5cur_time = time.strftime(‘%Y-%m-%d %H:%M:%S‘) 6 cur_time = time.strftime(‘%...

python-16：模块 time【图】

import datetimedatetime.datetime.now() 原文：https://www.cnblogs.com/Zhouzg-2018/p/9822913.html

Python time模块

time模块命令time.time()：获取时间搓time.timezone：标准时间UTC的时间少多少秒。time.altzone：夏令时DST的时间少多少秒。time.daylight：是否使用夏令时，0是没使用time.sleep(2)：睡几秒。之后在往下执行。time.gmtime：传回一个元组，时间格式。默认为UTC时间，可以加入时间搓。time.localtime：传回一个元组，时间格式。默认为本地时区时间可以加入时间搓。time.mktime(元组变量)：可以求出元组的时间搓。strftime("时间参...

pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.【图】

用pip安装tornado库：python -m pip install tornado出现问题一： Could not fetch URL https://pypi.org/simple/twisted/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host=‘pypi.org‘, port=443): Max retries exceeded with url: /simple/twisted/ (Caused by SSLError("Can‘t connect to HTTPS URL because the SSL module is not available.")) - skipping 解决办法：python -m pip insta...

Python基础模块：datetime模块

datetime介绍：datetime是Python处理日期和时间的标准库。它包含了五个类：datetime，date，time，timedelta, tzinfodatetime数据对象：使用strptime把字符串日期转变为此对象，可以使用datetime里的一些方法和属性，比如date(),time(),year,month,day,strftime(),replace()等。导入语句：from datetime import datetime,date,time,timedelta, timezone，把所有类全都都导入。如果只导入包名：import datetime，使用的时候必须dat...

Python time模块返回格式化时间【代码】

常用命令 strftimetime.strftime("%Y-%m-%d %H:%M:%S", formattime) 第二个参数为可选参数 ,不填第二个参数则返回格式化后的当前时间日期 #2018-12-1 12:00:00time.strftime(‘%H:%M:%S‘)#返回当前时间的时分秒time.strftime(‘%y-%m-%d‘)#返回当前时间的日期strptime将格式字符串转化成struct_time. 该函数是time.strftime()函数的逆操作。time strptime() 函数根据指定的格式把一个时间字符串解析为时间元组。所以函数返回...

Python3基础 time 索引值访问元组中的年月日时分秒【代码】

???? Python : 3.7.0?????? OS : Ubuntu 18.04.1 LTS?????? IDE : PyCharm 2018.2.4????? Conda : 4.5.11???typesetting : Markdowncode""" @Author : 行初心 @Date : 18-10-2 @Blog : www.cnblogs.com/xingchuxin @Gitee : gitee.com/zhichengjiu """ import timedef main():my_time = time.localtime()print(my_time[0], "年")print(my_time[1], "月")print(my_time[2], "日")print(my_time[3], "时")print(my_time[4]...

python 学习笔记 14 -- 常用的时间模块之datetime

书接上文，前面我们讲到《常用的时间模块之time》，这次我们学习datetime-- 日期和时间值管理模块使用apihelper 查看datetime 模块，我们可以看到简单的几项：date --- 日期对象，结构为date(year, month, day)time --- 时间值对象，结构为 time([hour[, minute[, second[, microsecond[, tzinfo]]]]])。时间对象所有的参数都是可选的。tzinfo 可以是None 或者是tzinfo子类的一个实例。 datetime --- 日期和时间...

python time,datetime与highchart中的time

http://www.2cto.com/kf/201109/102535.html http://www.cnblogs.com/goodspeed/archive/2011/11/06/python_time.html python time.time(), mktime, datetime解析 highcharts中使用的一种时间形式如下，这个值代表该时间值与1970/1/1之间的差值，注意单位是毫秒，而python mktime的单位是秒，需要x1000 data: [ [1411747200000.0, 0.4], [1411747300000.0, 0.5], [1411...

首页 / PYTHON / Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？

内容导读

内容图文

内容总结

内容备注

内容手机端

【Python 2 vs.3缓慢的os.path.getmtime()具有庞大的文件列表-为什么呢？】教程文章相关的互联网学习教程文章

python中时间戳，datetime 和时间字符串之间得转换【代码】

python关于时间的计算，time模块【代码】

python_datetime模块【代码】

[LeetCode]题解（python）：122-Best Time to Buy and Sell Stock II【代码】【图】

Python time模块【图】

Python学习笔记：time模块的使用【代码】

python 之时间模块 time【代码】

python-16：模块 time【图】

Python time模块

pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.【图】

Python基础模块：datetime模块

Python time模块返回格式化时间【代码】

Python3基础 time 索引值访问元组中的年月日时分秒【代码】

python 学习笔记 14 -- 常用的时间模块之datetime

python time,datetime与highchart中的time

PYTHON - 相关标签

TIME - 相关标签

文件 - 相关标签

PYTHON - 技术教程分类

PYTHON - 最新教程

PYTHON - 最热教程