python-unionAll导致StackOverflow
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python-unionAll导致StackOverflow,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含3018字,纯文字阅读大概需要5分钟。
内容图文
![python-unionAll导致StackOverflow](/upload/InfoBanner/zyjiaocheng/693/102524096d1448229ed47fbd1fd66521.jpg)
我在StackOverflow上遇到了自己的问题(how to load a dataframe from a python requests stream that is downloading a csv file?),已经取得了一些进展,但是我收到了StackOverflow错误:
import requests
import numpy as np
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
chunk_size = 1024
url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=OPEN".format(host, filepath)
r = requests.get(url, auth=(username, password),
verify=False, allow_redirects=True,
stream=True)
df = None
curr_line = 1
remainder = ''
for chunk in r.iter_content(chunk_size):
txt = remainder + chunk
[lines, remainder] = txt.rsplit('\n', 1)
pdf = pd.read_csv(StringIO(lines), sep='|', header=None)
if df == None:
df = sqlContext.createDataFrame(pdf)
else:
df = df.unionAll(sqlContext.createDataFrame(pdf))
print df.count()
堆栈跟踪在这里:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-b3a89df3c7d8> in <module>()
36 df = sqlContext.createDataFrame(pdf)
37 else:
---> 38 df = df.unionAll(sqlContext.createDataFrame(pdf))
39
40 #curr_line = curr_line + 1
/usr/local/src/spark160master/spark/python/pyspark/sql/dataframe.py in unionAll(self, other)
993 This is equivalent to `UNION ALL` in SQL.
994 """
--> 995 return DataFrame(self._jdf.unionAll(other._jdf), self.sql_ctx)
996
997 @since(1.3)
/usr/local/src/spark160master/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/local/src/spark160master/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/local/src/spark160master/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling o19563.unionAll.
: java.lang.StackOverflowError
我不确定如何解决此问题.任何提示表示赞赏.
解决方法:
您不应该在不控制分区数量的情况下迭代地合并分布式数据结构.您会找到一个完整的解释,说明Stackoverflow due to long RDD Lineage中发生了什么,但是不幸的是DataFrames有点棘手:
dfs = ... # A list of pyspark.sql.DataFrame
def unionAll(*dfs):
if not dfs:
raise ValueError()
first = dfs[0]
return df.sql_ctx.createDataFrame(
df._sc.union([df.rdd for df in dfs]), first.schema
)
unionAll(*dfs)
内容总结
以上是互联网集市为您收集整理的python-unionAll导致StackOverflow全部内容,希望文章能够帮你解决python-unionAll导致StackOverflow所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。