🎯Pyspark Challenge:
АнализиранС Π½Π° Π΄Π°Π½Π½ΠΈΡ‚Π΅ Π·Π° ΠΏΡ€ΠΎΠ΄Π°ΠΆΠ±ΠΈΡ‚Π΅ Π² Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ½Π½Π°Ρ‚Π° Ρ‚ΡŠΡ€Π³ΠΎΠ²ΠΈΡ

πŸ”ŽΠ˜ΡΡ‚ΠΎΡ€ΠΈΡ:
Π’ΠΈΠ΅ Ρ€Π°Π±ΠΎΡ‚ΠΈΡ‚Π΅ Π·Π° компания Π·Π° Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ½Π½Π° Ρ‚ΡŠΡ€Π³ΠΎΠ²ΠΈΡ ΠΈ Ρ‚Π΅ са Π²ΠΈ прСдоставили Π½Π°Π±ΠΎΡ€ ΠΎΡ‚ Π΄Π°Π½Π½ΠΈ, ΡΡŠΠ΄ΡŠΡ€ΠΆΠ°Ρ‰ информация Π·Π° Ρ‚Π΅Ρ…Π½ΠΈΡ‚Π΅ ΠΏΡ€ΠΎΠ΄Π°ΠΆΠ±ΠΈ. Π’Π°ΡˆΠ°Ρ‚Π° Π·Π°Π΄Π°Ρ‡Π° Π΅ Π΄Π° ΠΈΠ·Π²ΡŠΡ€ΡˆΠ²Π°Ρ‚Π΅ Ρ€Π°Π·Π»ΠΈΡ‡Π½ΠΈ трансформации Π½Π° Π΄Π°Π½Π½ΠΈ с ΠΏΠΎΠΌΠΎΡ‰Ρ‚Π° Π½Π° PySpark, Π·Π° Π΄Π° Π³Π΅Π½Π΅Ρ€ΠΈΡ€Π°Ρ‚Π΅ прозрСния.

πŸ“ŠΠŸΡ€ΠΈΠΌΠ΅Ρ€Π½ΠΈ Π΄Π°Π½Π½ΠΈ:

| ID_Π½Π°_ΠΏΠΎΡ€ΡŠΡ‡ΠΊΠ° | customer_id | Π΄Π°Ρ‚Π°_Π½Π°_ΠΏΠΎΡ€ΡŠΡ‡ΠΊΠ° | product_id | количСство | Ρ†Π΅Π½Π° |
|----------|-------------|------------|----- -------|----------|-------|
| 1 | 101 | 01.07.2023 | A | 2 | 10 |
| 2 | 102 | 01.07.2023 | B | 3 | 15 |
| 3 | 101 | 2023-07-02 | A | 1 | 10 |
| 4 | 103 | 2023-07-02 | C | 2 | 20 |
| 5 | 102 | 03.07.2023 | A | 1 | 10 |

πŸŽ―Π—Π°Π΄Π°Ρ‡ΠΈ Π·Π° прСдизвикатСлства:
1. Π—Π°Ρ€Π΅Π΄Π΅Ρ‚Π΅ Π½Π°Π±ΠΎΡ€Π° ΠΎΡ‚ Π΄Π°Π½Π½ΠΈ Π² PySpark DataFrame.
2. Π˜Π·Ρ‡ΠΈΡΠ»Π΅Ρ‚Π΅ общия ΠΏΡ€ΠΈΡ…ΠΎΠ΄ Π·Π° всяка ΠΏΠΎΡ€ΡŠΡ‡ΠΊΠ°.
3. НамСрСтС Π½Π°ΠΉ-ΠΏΡ€ΠΎΠ΄Π°Π²Π°Π½ΠΈΡ‚Π΅ ΠΏΡ€ΠΎΠ΄ΡƒΠΊΡ‚ΠΈ (ΠΏΠΎ ΠΎΠ±Ρ‰ΠΎ ΠΏΡ€ΠΎΠ΄Π°Π΄Π΅Π½ΠΎ количСство) Π² Π½Π°Π±ΠΎΡ€Π° ΠΎΡ‚ Π΄Π°Π½Π½ΠΈ.
4. Π˜Π·Ρ‡ΠΈΡΠ»Π΅Ρ‚Π΅ срСдното количСство ΠΈ Ρ†Π΅Π½Π° Π½Π° ΠΏΠΎΡ€ΡŠΡ‡ΠΊΠ°.
5. ΠžΠΏΡ€Π΅Π΄Π΅Π»Π΅Ρ‚Π΅ общия ΠΏΡ€ΠΈΡ…ΠΎΠ΄ Π·Π° всСки ΠΊΠ»ΠΈΠ΅Π½Ρ‚.
6. Π˜Π΄Π΅Π½Ρ‚ΠΈΡ„ΠΈΡ†ΠΈΡ€Π°ΠΉΡ‚Π΅ Π΄Π°Ρ‚Π°Ρ‚Π° с Π½Π°ΠΉ-висок ΠΎΠ±Ρ‰ ΠΏΡ€ΠΈΡ…ΠΎΠ΄.

Π’ΠΎΠ²Π° прСдизвикатСлство ΠΎΠ±Ρ…Π²Π°Ρ‰Π° Ρ€Π°Π·Π»ΠΈΡ‡Π½ΠΈ аспСкти Π½Π° трансформиранСто Π½Π° Π΄Π°Π½Π½ΠΈ с ΠΏΠΎΠΌΠΎΡ‰Ρ‚Π° Π½Π° DataFrame API Π½Π° PySpark.

βœοΈΠΠ°ΠΏΠΈΡˆΠ΅Ρ‚Π΅ своСто Ρ€Π΅ΡˆΠ΅Π½ΠΈΠ΅ Π² ΠΏΠΎΠ»Π΅Ρ‚ΠΎ Π·Π° ΠΊΠΎΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈ

πŸ‘‰Π‘Π»Π΅Π΄Π²Π°ΠΉΡ‚Π΅ Π·Π° ΠΎΡ‰Π΅: Sandeep Suthrame

#pyspark #databricks #dataengineering #dataanalytics #datascience

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, max

# Create a Spark session
spark = SparkSession.builder.appName("ECommerceAnalysis").getOrCreate()

# Load the dataset
data = [
    (1, 101, "2023-07-01", "A", 2, 10),
    (2, 102, "2023-07-01", "B", 3, 15),
    (3, 101, "2023-07-02", "A", 1, 10),
    (4, 103, "2023-07-02", "C", 2, 20),
    (5, 102, "2023-07-03", "A", 1, 10)
]

columns = ["order_id", "customer_id", "order_date", "product_id", "quantity", "price"]
df = spark.createDataFrame(data, columns)

# Task 2: Calculate total revenue for each order
df = df.withColumn("revenue", col("quantity") * col("price"))

# Task 3: Top-selling products
top_products = df.groupBy("product_id").agg(sum("quantity").alias("total_quantity_sold"))
top_products = top_products.orderBy(col("total_quantity_sold").desc()).limit(3)

# Task 4: Calculate average quantity and price per order
avg_quantity_price = df.groupBy("order_id").agg(avg("quantity").alias("avg_quantity"), avg("price").alias("avg_price"))

# Task 5: Total revenue per customer
revenue_per_customer = df.groupBy("customer_id").agg(sum("revenue").alias("total_revenue"))

# Task 6: Date with highest total revenue
highest_revenue_date = df.groupBy("order_date").agg(sum("revenue").alias("total_revenue"))
highest_revenue_date = highest_revenue_date.orderBy(col("total_revenue").desc()).limit(1)

# Show results
top_products.show()
avg_quantity_price.show()
revenue_per_customer.show()
highest_revenue_date.show()

# Stop the Spark session
spark.stop()