В този проект:
1) Работихме, за да напишем оптимална заявка за нашия аналитичен модел в услугата BigQuery.
2) Обучение на нашия модел на машинно обучение.
Първо вземам публичен набор от данни, наречен Пътувания с жълто такси в Ню Йорк.
Първо ще започнем с изследване на нашия набор от данни.
#standardSQL SELECT TIMESTAMP_TRUNC(pickup_datetime, MONTH) month, COUNT(*) trips FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` GROUP BY 1 ORDER BY 1
Тази заявка ще ви даде резултат с колони като месец и брой общи пътувания, извършени през този месец.
Сега ще изберем функции (колони) за обучение на нашия модел.
- Сума на пътните такси
- Сума на билета
- Час на деня
- Адрес за получаване
- Адрес за предаване
- Брой пътници
#standardSQL WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips
За да създадем BigQuery ML модел, първо трябва да създадем набор от данни в нашия gcp проект.
— Кръщавам ID на набора си от данни на такси.
Сега нека създадем и обучим нашето машинно обучение.
Този модел трябва да бъде тип линейна регресия, защото трябва да предвидим определена стойност.
CREATE or REPLACE MODEL taxi.taxifare_model OPTIONS (model_type='linear_reg', labels=['total_fare']) AS WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips
Сега ще оценим нашия ML модел.
#standardSQL SELECT SQRT(mean_squared_error) AS rmse FROM ML.EVALUATE(MODEL taxi.taxifare_model, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips ))
След като оцените нашия модел, вие получавате RMSE от $9,47...
Накрая ще прогнозираме тарифа за пътуване с такси с този модел за анализ на данни.
#standardSQL SELECT * FROM ml.PREDICT(MODEL `taxi.taxifare_model`, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips ));
И накрая, можем лесно да видим в изходната таблица, в която има предвидена стойност на това конкретно пътуване, т.е. predicted_total_fare въз основа на параметрите, които сме предоставили, с действителната тарифа, т.е. total_fare за сравнение на прогнозите.
Яййй!! Успешно създадохме модел и предвидихме стойност за таксата за такси, всичко това в Google Cloud Platform!✌