在python Apache Beam中打开一个gzip文件
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了在python Apache Beam中打开一个gzip文件,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含3159字,纯文字阅读大概需要5分钟。
内容图文
目前是否可以使用Apache Beam在python中读取gzip文件?
我的管道是用这行代码从gcs中提取gzip文件:
beam.io.Read(beam.io.TextFileSource('gs://bucket/file.gz', compression_type='GZIP'))
但是我收到了这个错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
我们在python beam源代码中注意到,在写入接收器时似乎处理了压缩文件.
https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L445
更详细的追溯:
Traceback (most recent call last):
File "beam-playground.py", line 11, in <module>
p.run()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/pipeline.py", line 159, in run
return self.runner.run(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 103, in run
super(DirectPipelineRunner, self).run(pipeline)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 98, in run
pipeline.visit(RunVisitor(self))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/pipeline.py", line 182, in visit
self._root_transform().visit(visitor, self, visited)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/pipeline.py", line 419, in visit
part.visit(visitor, pipeline, visited)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/pipeline.py", line 422, in visit
visitor.visit_transform(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 93, in visit_transform
self.runner.run_transform(transform_node)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in run_transform
return m(transform_node)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 99, in func_wrapper
func(self, pvalue, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 258, in run_Read
read_values(reader)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 245, in read_values
read_result = [GlobalWindows.windowed_value(e) for e in reader]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/io/fileio.py", line 807, in __iter__
yield self.source.coder.decode(line)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/apache_beam/coders/coders.py", line 187, in decode
return value.decode('utf-8')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
解决方法:
更新:Python SDK中的TextIO现在支持从压缩文件中读取.
今天,Python SDK中的TextIO实际上并不支持从压缩文件中读取.
内容总结
以上是互联网集市为您收集整理的在python Apache Beam中打开一个gzip文件全部内容,希望文章能够帮你解决在python Apache Beam中打开一个gzip文件所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。