π―Pyspark Challenge:
ΠΠ½Π°Π»ΠΈΠ·ΠΈΡΠ°Π½Π΅ Π½Π° Π΄Π°Π½Π½ΠΈΡΠ΅ Π·Π° ΠΏΡΠΎΠ΄Π°ΠΆΠ±ΠΈΡΠ΅ Π² Π΅Π»Π΅ΠΊΡΡΠΎΠ½Π½Π°ΡΠ° ΡΡΡΠ³ΠΎΠ²ΠΈΡ
πΠΡΡΠΎΡΠΈΡ:
ΠΠΈΠ΅ ΡΠ°Π±ΠΎΡΠΈΡΠ΅ Π·Π° ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΡ Π·Π° Π΅Π»Π΅ΠΊΡΡΠΎΠ½Π½Π° ΡΡΡΠ³ΠΎΠ²ΠΈΡ ΠΈ ΡΠ΅ ΡΠ° Π²ΠΈ ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²ΠΈΠ»ΠΈ Π½Π°Π±ΠΎΡ ΠΎΡ Π΄Π°Π½Π½ΠΈ, ΡΡΠ΄ΡΡΠΆΠ°Ρ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ Π·Π° ΡΠ΅Ρ
Π½ΠΈΡΠ΅ ΠΏΡΠΎΠ΄Π°ΠΆΠ±ΠΈ. ΠΠ°ΡΠ°ΡΠ° Π·Π°Π΄Π°ΡΠ° Π΅ Π΄Π° ΠΈΠ·Π²ΡΡΡΠ²Π°ΡΠ΅ ΡΠ°Π·Π»ΠΈΡΠ½ΠΈ ΡΡΠ°Π½ΡΡΠΎΡΠΌΠ°ΡΠΈΠΈ Π½Π° Π΄Π°Π½Π½ΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΠ° Π½Π° PySpark, Π·Π° Π΄Π° Π³Π΅Π½Π΅ΡΠΈΡΠ°ΡΠ΅ ΠΏΡΠΎΠ·ΡΠ΅Π½ΠΈΡ.
πΠΡΠΈΠΌΠ΅ΡΠ½ΠΈ Π΄Π°Π½Π½ΠΈ:
| ID_Π½Π°_ΠΏΠΎΡΡΡΠΊΠ° | customer_id | Π΄Π°ΡΠ°_Π½Π°_ΠΏΠΎΡΡΡΠΊΠ° | product_id | ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ | ΡΠ΅Π½Π° |
|----------|-------------|------------|----- -------|----------|-------|
| 1 | 101 | 01.07.2023 | A | 2 | 10 |
| 2 | 102 | 01.07.2023 | B | 3 | 15 |
| 3 | 101 | 2023-07-02 | A | 1 | 10 |
| 4 | 103 | 2023-07-02 | C | 2 | 20 |
| 5 | 102 | 03.07.2023 | A | 1 | 10 |
π―ΠΠ°Π΄Π°ΡΠΈ Π·Π° ΠΏΡΠ΅Π΄ΠΈΠ·Π²ΠΈΠΊΠ°ΡΠ΅Π»ΡΡΠ²Π°:
1. ΠΠ°ΡΠ΅Π΄Π΅ΡΠ΅ Π½Π°Π±ΠΎΡΠ° ΠΎΡ Π΄Π°Π½Π½ΠΈ Π² PySpark DataFrame.
2. ΠΠ·ΡΠΈΡΠ»Π΅ΡΠ΅ ΠΎΠ±ΡΠΈΡ ΠΏΡΠΈΡ
ΠΎΠ΄ Π·Π° Π²ΡΡΠΊΠ° ΠΏΠΎΡΡΡΠΊΠ°.
3. ΠΠ°ΠΌΠ΅ΡΠ΅ΡΠ΅ Π½Π°ΠΉ-ΠΏΡΠΎΠ΄Π°Π²Π°Π½ΠΈΡΠ΅ ΠΏΡΠΎΠ΄ΡΠΊΡΠΈ (ΠΏΠΎ ΠΎΠ±ΡΠΎ ΠΏΡΠΎΠ΄Π°Π΄Π΅Π½ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ) Π² Π½Π°Π±ΠΎΡΠ° ΠΎΡ Π΄Π°Π½Π½ΠΈ.
4. ΠΠ·ΡΠΈΡΠ»Π΅ΡΠ΅ ΡΡΠ΅Π΄Π½ΠΎΡΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈ ΡΠ΅Π½Π° Π½Π° ΠΏΠΎΡΡΡΠΊΠ°.
5. ΠΠΏΡΠ΅Π΄Π΅Π»Π΅ΡΠ΅ ΠΎΠ±ΡΠΈΡ ΠΏΡΠΈΡ
ΠΎΠ΄ Π·Π° Π²ΡΠ΅ΠΊΠΈ ΠΊΠ»ΠΈΠ΅Π½Ρ.
6. ΠΠ΄Π΅Π½ΡΠΈΡΠΈΡΠΈΡΠ°ΠΉΡΠ΅ Π΄Π°ΡΠ°ΡΠ° Ρ Π½Π°ΠΉ-Π²ΠΈΡΠΎΠΊ ΠΎΠ±Ρ ΠΏΡΠΈΡ
ΠΎΠ΄.
Π’ΠΎΠ²Π° ΠΏΡΠ΅Π΄ΠΈΠ·Π²ΠΈΠΊΠ°ΡΠ΅Π»ΡΡΠ²ΠΎ ΠΎΠ±Ρ Π²Π°ΡΠ° ΡΠ°Π·Π»ΠΈΡΠ½ΠΈ Π°ΡΠΏΠ΅ΠΊΡΠΈ Π½Π° ΡΡΠ°Π½ΡΡΠΎΡΠΌΠΈΡΠ°Π½Π΅ΡΠΎ Π½Π° Π΄Π°Π½Π½ΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΠ° Π½Π° DataFrame API Π½Π° PySpark.
βοΈΠΠ°ΠΏΠΈΡΠ΅ΡΠ΅ ΡΠ²ΠΎΠ΅ΡΠΎ ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ Π² ΠΏΠΎΠ»Π΅ΡΠΎ Π·Π° ΠΊΠΎΠΌΠ΅Π½ΡΠ°ΡΠΈ
πΠ‘Π»Π΅Π΄Π²Π°ΠΉΡΠ΅ Π·Π° ΠΎΡΠ΅: Sandeep Suthrame
#pyspark #databricks #dataengineering #dataanalytics #datascience
from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum, avg, max # Create a Spark session spark = SparkSession.builder.appName("ECommerceAnalysis").getOrCreate() # Load the dataset data = [ (1, 101, "2023-07-01", "A", 2, 10), (2, 102, "2023-07-01", "B", 3, 15), (3, 101, "2023-07-02", "A", 1, 10), (4, 103, "2023-07-02", "C", 2, 20), (5, 102, "2023-07-03", "A", 1, 10) ] columns = ["order_id", "customer_id", "order_date", "product_id", "quantity", "price"] df = spark.createDataFrame(data, columns) # Task 2: Calculate total revenue for each order df = df.withColumn("revenue", col("quantity") * col("price")) # Task 3: Top-selling products top_products = df.groupBy("product_id").agg(sum("quantity").alias("total_quantity_sold")) top_products = top_products.orderBy(col("total_quantity_sold").desc()).limit(3) # Task 4: Calculate average quantity and price per order avg_quantity_price = df.groupBy("order_id").agg(avg("quantity").alias("avg_quantity"), avg("price").alias("avg_price")) # Task 5: Total revenue per customer revenue_per_customer = df.groupBy("customer_id").agg(sum("revenue").alias("total_revenue")) # Task 6: Date with highest total revenue highest_revenue_date = df.groupBy("order_date").agg(sum("revenue").alias("total_revenue")) highest_revenue_date = highest_revenue_date.orderBy(col("total_revenue").desc()).limit(1) # Show results top_products.show() avg_quantity_price.show() revenue_per_customer.show() highest_revenue_date.show() # Stop the Spark session spark.stop()