PySpark DataFrameの結合(Join)のまとめ

ロウ
2022-08-26
ロウ
2022-08-26

DataFrameを作成

```
# DataFrame df1を作成
data1 = [("Alice", 20), ("James", 25), ("Maria", 30)]
df1 = spark.createDataFrame(data1, ["name", "age"])
df1.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# |James| 25|
# |Maria| 30|
# +-----+---+

# DataFrame df2を作成
data2 = [("Alice", "F"), ("Michael", "M")]
df2 = spark.createDataFrame(data2, ["name", "gender"])
df2.show()
# 結果:
# +-------+------+
# |   name|gender|
# +-------+------+
# |  Alice|     F|
# |Michael|     M|
# +-------+------+
```

Inner Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="inner")
df.show()
# 結果:
# +-----+---+-----+------+
# | name|age| name|gender|
# +-----+---+-----+------+
# |Alice| 20|Alice|     F|
# +-----+---+-----+------+
```

Outer Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="outer")
df.show()
# 結果:
# +-----+----+-------+------+
# | name| age|   name|gender|
# +-----+----+-------+------+
# |James|  25|   null|  null|
# | null|null|Michael|     M|
# |Alice|  20|  Alice|     F|
# |Maria|  30|   null|  null|
# +-----+----+-------+------+
```

Left Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="left")
df.show()
# 結果:
# +-----+---+-----+------+
# | name|age| name|gender|
# +-----+---+-----+------+
# |James| 25| null|  null|
# |Alice| 20|Alice|     F|
# |Maria| 30| null|  null|
# +-----+---+-----+------+
```

Leftsemi Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="leftsemi")
df.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 20|
# +-----+---+
```

Leftanti Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="leftanti")
df.show()
# 結果:
# +-----+---+
# | name|age|
# +-----+---+
# |James| 25|
# |Maria| 30|
# +-----+---+
```

Right Join

```
df = df1.join(df2, on=[df1.name == df2.name], how="right")
df.show()
# 結果:
# +-----+----+-------+------+
# | name| age|   name|gender|
# +-----+----+-------+------+
# | null|null|Michael|     M|
# |Alice|  20|  Alice|     F|
# +-----+----+-------+------+
```