How to join two RDDs in spark with python? - Stack Overflow
stackoverflow.com › questions › 309889961 Answer Sorted by: 14 You are just looking for a simple join, e.g. rdd = sc.parallelize ( [ ("red",20), ("red",30), ("blue", 100)]) rdd2 = sc.parallelize ( [ ("red",40), ("red",50), ("yellow", 10000)]) rdd.join (rdd2).collect () # Gives [ ('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))] Share Improve this answer Follow