PySpark DataFrameの結合 (unionとunionByName)
2022-01-28
2022-01-28
union
```# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# +-----+---+
# DataFrame df2を作成
data2 = [("Maria", 30), ("Michael", 35)]
df2 = spark.createDataFrame(data2, ["name", "age"])
df2.show()
# 結果:
# +-------+---+
# | name|age|
# +-------+---+
# | Maria| 30|
# |Michael| 35|
# +-------+---+
# 結合
df = df1.union(df2)
df.show()
# 結果:
# +-------+---+
# | name|age|
# +-------+---+
# | Alice| 20|
# | James| 25|
# | Maria| 30|
# |Michael| 35|
# +-------+---+
```
unionByName
関数unionByNameと関数unionの違いは、この関数が位置ではなくて、カラム名でDataFrameを結合することです。
```# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# +-----+---+
# DataFrame df2を作成
data2 = [("Maria", 30, "F"), ("Michael", 35, "M")]
df2 = spark.createDataFrame(data2, ["name", "age", "gender"])
df2.show()
# 結果:
# +-------+---+------+
# | name|age|gender|
# +-------+---+------+
# | Maria| 30| F|
# |Michael| 35| M|
# +-------+---+------+
# 結合
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# 結果:
# +-------+---+------+
# | name|age|gender|
# +-------+---+------+
# | Alice| 20| null|
# | James| 25| null|
# | Maria| 30| F|
# |Michael| 35| M|
# +-------+---+------+
```