Datenanalyse- Verkaufszahlen

Verkaufszahlen Analyse – Python

kaggle data source:

https://www.kaggle.com/datasets/sinjoysaha/sales-analysis-dataset

Content

This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019.
The naming convention is as follows: Sales_[MONTH_NAME]_2019
Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows:
Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address
There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

Repo1 Sales Analysis

In [ ]:

import pandas as pd
import os
import matplotlib.pyplot as plt

Zusammenführen von 12 Monaten Verkaufsdaten in einer einzigen CSV-Datei.

In [ ]:

files = [file for file in os.listdir("D:/py/sales_data/input_csvs/")]
all_months_data = pd.DataFrame() 

for file in files:
    df = pd.read_csv("D:/py/sales_data/input_csvs/"+file)
    all_months_data = pd.concat([all_months_data, df])

all_months_data.to_csv("D:/py/sales_data/all_data.csv", index=False)

Einlesen der Datei und anschauen der ersten Zeilen

In [ ]:

all_data=pd.read_csv("D:/py/sales_data/all_data.csv")
all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
1	NaN	NaN	NaN	NaN	NaN	NaN
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001

In [ ]:

all_data=all_data.dropna(how='all')
all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001

In [ ]:

#Removing rows with "Or" in "Order Date" column ?
all_data = all_data[all_data['Order Date'].str[0:2]!='Or']

all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4

In [ ]:

all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

all_data['Sales'] = all_data['Quantity Ordered']*all_data['Price Each']
all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99

Welcher Monat verzeichnete die höchsten Verkaufszahlen und wie hoch war der Umsatz in diesem Monat?"¶

In [ ]:

all_data.groupby('Month').sum()

all_data.groupby('Month').sum().to_csv("D:/py/sales_data/all_data_month.csv", index=False)

C:\Users\gov\AppData\Local\Temp\ipykernel_3096\1744586910.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  all_data.groupby('Month').sum()
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\1744586910.py:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  all_data.groupby('Month').sum().to_csv("D:/py/sales_data/all_data_month.csv", index=False)

In [ ]:

months = range(1,13)
results = all_data.groupby('Month').sum()

plt.bar(months, results['Sales'])
plt.xticks(months)
labels, location = plt.yticks()
plt.yticks(labels, (labels).astype(int))
plt.ylabel('Verkäufe in USD')
plt.xlabel('Monat')
plt.show()

C:\Users\gov\AppData\Local\Temp\ipykernel_3096\2101255051.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  results = all_data.groupby('Month').sum()

Welche Stadt hat die höchsten Verkaufszahlen?¶

In [ ]:

def get_city(address):
    return address.split(',')[1]

all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' ')

all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90	Dallas
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99	Boston
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles

In [ ]:

results2 = all_data.groupby('City').sum()
results2

C:\Users\gov\AppData\Local\Temp\ipykernel_3096\470953172.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  results2 = all_data.groupby('City').sum()

Out[ ]:

	Quantity Ordered	Price Each	Month	Sales
City
Atlanta	33204	5559816.40	209588	5590997.16
Austin	22306	3619747.22	139658	3639163.50
Boston	45056	7274819.54	282224	7323284.02
Dallas	33460	5505255.64	209240	5535950.80
Los Angeles	66578	10842870.46	416650	10905141.60
New York City	55864	9270741.66	351482	9328634.86
Portland	28106	4615494.94	175530	4640981.22
San Francisco	100478	16422923.48	631040	16524407.82
Seattle	33106	5466592.02	209882	5495510.96

In [ ]:

cities = [city for city, df in all_data.groupby('City')]

plt.bar(cities, results2['Sales'])
plt.xticks(cities, rotation='vertical', size = 8)
labels, location = plt.yticks()
plt.yticks(labels, (labels).astype(int))
plt.ylabel('Verkäufe in USD')
plt.xlabel('Stadt')
plt.show()

Zu welcher Zeit sollten wir Anzeigen schalten, um die Wahrscheinlichkeit zu maximieren, dass Kunden ein Produkt kaufen?¶

In [ ]:

all_data['Order_Date_DTO'] = pd.to_datetime(all_data['Order Date'])

all_data['Hour'] = all_data['Order_Date_DTO'].dt.hour
all_data.head()

Out[ ]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	Order_Date_DTO	Hour
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90	Dallas	2019-04-19 08:46:00	8
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99	Boston	2019-04-07 22:30:00	22
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles	2019-04-12 14:38:00	14
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles	2019-04-12 14:38:00	14
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles	2019-04-30 09:27:00	9

In [ ]:

results3 = all_data.groupby(['Hour'])['Quantity Ordered'].count()
hours = [hour for hour, df in all_data.groupby('Hour')]

plt.plot(hours, results3)
plt.xticks(hours)
plt.xlabel('Stunde')
plt.ylabel('Bestellungen Anzahl')
plt.grid()
plt.show()

Welcher Bundle Artikel würde sich empfehlen?¶

In [ ]:

new_all = all_data[all_data['Order ID'].duplicated(keep=False)]
new_all['Product_Bundle'] = new_all.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
new_all = new_all[['Order ID','Product_Bundle']].drop_duplicates()
new_all.head()

Out[ ]:

	Order ID	Product_Bundle
0	176558	USB-C Charging Cable,USB-C Charging Cable
2	176559	Bose SoundSport Headphones,Bose SoundSport Hea...
3	176560	Google Phone,Wired Headphones,Google Phone,Wir...
5	176561	Wired Headphones,Wired Headphones
6	176562	USB-C Charging Cable,USB-C Charging Cable

Welches Produkt wurde am meisten verkauft?¶

In [ ]:

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product, df in product_group]

plt.bar(products, quantity_ordered, color='b')
plt.ylabel('Bestellaufkommen', color='b')
plt.xlabel('Produktname')
plt.xticks(products, rotation='vertical', size=8)
plt.show()

C:\Users\gov\AppData\Local\Temp\ipykernel_3096\3496259732.py:4: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  quantity_ordered = product_group.sum()['Quantity Ordered']

Gegenüberstellung: Preise der Produkte und Bestellaufkommen¶

In [ ]:

prices = all_data.groupby('Product').mean()['Price Each'] 
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.bar(products, quantity_ordered, color='b')
ax2.plot(products, prices, color='r')

ax1.set_xlabel('Produktname')
ax1.set_ylabel('Bestellaufkommen', color='b')
ax2.set_ylabel('Preis in USD', color = 'r')
ax1.set_xticklabels(products, rotation='vertical', size=8)

plt.show()

C:\Users\gov\AppData\Local\Temp\ipykernel_3096\4049133573.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  prices = all_data.groupby('Product').mean()['Price Each']
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\4049133573.py:10: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax1.set_xticklabels(products, rotation='vertical', size=8)

Grischa Overbeck

Persönlicher Blog

Datenanalyse- Verkaufszahlen

Verkaufszahlen Analyse – Python

kaggle data source:

https://www.kaggle.com/datasets/sinjoysaha/sales-analysis-dataset

Content

Welcher Monat verzeichnete die höchsten Verkaufszahlen und wie hoch war der Umsatz in diesem Monat?"¶

Welche Stadt hat die höchsten Verkaufszahlen?¶

Zu welcher Zeit sollten wir Anzeigen schalten, um die Wahrscheinlichkeit zu maximieren, dass Kunden ein Produkt kaufen?¶

Welcher Bundle Artikel würde sich empfehlen?¶

Welches Produkt wurde am meisten verkauft?¶

Gegenüberstellung: Preise der Produkte und Bestellaufkommen¶