PySpark DataFrameの結合 (unionとunionByName)

ロウ
2022-01-28
ロウ
2022-01-28

union

```
# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# +-----+---+

# DataFrame df2を作成
data2 = [("Maria", 30), ("Michael", 35)]
df2 = spark.createDataFrame(data2, ["name", "age"])
df2.show()
# 結果:
# +-------+---+
# |   name|age|
# +-------+---+
# |  Maria| 30|
# |Michael| 35|
# +-------+---+

# 結合
df = df1.union(df2)
df.show()
# 結果:
# +-------+---+
# |   name|age|
# +-------+---+
# |  Alice| 20|
# |  James| 25|
# |  Maria| 30|
# |Michael| 35|
# +-------+---+
```

unionByName

関数unionByNameと関数unionの違いは、この関数が位置ではなくて、カラム名でDataFrameを結合することです。

```
# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# +-----+---+

# DataFrame df2を作成
data2 = [("Maria", 30, "F"), ("Michael", 35, "M")]
df2 = spark.createDataFrame(data2, ["name", "age", "gender"])
df2.show()
# 結果:
# +-------+---+------+
# |   name|age|gender|
# +-------+---+------+
# |  Maria| 30|     F|
# |Michael| 35|     M|
# +-------+---+------+

# 結合
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# 結果:
# +-------+---+------+
# |   name|age|gender|
# +-------+---+------+
# |  Alice| 20|  null|
# |  James| 25|  null|
# |  Maria| 30|     F|
# |Michael| 35|     M|
# +-------+---+------+
```