PySpark DataFrameの結合(Join)のまとめ
2022-08-26
2022-08-26
DataFrameを作成
```# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25), ("Maria", 30)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# |Maria| 30|
# +-----+---+
# DataFrame df2を作成
data2 = [("Alice", "F"), ("Michael", "M")]
df2 = spark.createDataFrame(data2, ["name", "gender"])
df2.show()
# 結果:
# +-------+------+
# | name|gender|
# +-------+------+
# | Alice| F|
# |Michael| M|
# +-------+------+
```
Inner Join
```df = df1.join(df2, on=[df1.name == df2.name], how="inner")
df.show()
# 結果:
# +-----+---+-----+------+
# | name|age| name|gender|
# +-----+---+-----+------+
# |Alice| 20|Alice| F|
# +-----+---+-----+------+
```
Outer Join
```df = df1.join(df2, on=[df1.name == df2.name], how="outer")
df.show()
# 結果:
# +-----+----+-------+------+
# | name| age| name|gender|
# +-----+----+-------+------+
# |James| 25| null| null|
# | null|null|Michael| M|
# |Alice| 20| Alice| F|
# |Maria| 30| null| null|
# +-----+----+-------+------+
```
Left Join
```df = df1.join(df2, on=[df1.name == df2.name], how="left")
df.show()
# 結果:
# +-----+---+-----+------+
# | name|age| name|gender|
# +-----+---+-----+------+
# |James| 25| null| null|
# |Alice| 20|Alice| F|
# |Maria| 30| null| null|
# +-----+---+-----+------+
```
Leftsemi Join
```df = df1.join(df2, on=[df1.name == df2.name], how="leftsemi")
df.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# +-----+---+
```
Leftanti Join
```df = df1.join(df2, on=[df1.name == df2.name], how="leftanti")
df.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |James| 25|
# |Maria| 30|
# +-----+---+
```
Right Join
```df = df1.join(df2, on=[df1.name == df2.name], how="right")
df.show()
# 結果:
# +-----+----+-------+------+
# | name| age| name|gender|
# +-----+----+-------+------+
# | null|null|Michael| M|
# |Alice| 20| Alice| F|
# +-----+----+-------+------+
```