Neo4j: Analyzing the supplier's list of Apple and Samsung
In this article, we will review Neo4j basics by getting data about Apple and Samsung supplier lists. We are analyzing the supplier list of Apple and Samsung to compare what common suppliers they have and where they are located.
You can follow the tutorial here in the ipynb notebook
We started with the Apple supplier list from here and for Samsung, we scraped data from here. Here we are comparing the supplier location for both of these electronics companies and where are they located. We have different fields like the address name, country name and supplier name. We had to manually clean the data for some integrity.
DATA COLLECTION PIPELINE
The Apple supplier list was a PDF, so to scrape the relevant fields, we used Pymudf and with below code
from unstructured.partition.auto import partition
elements = partition("Apple-Supplier-List.pdf",include_page_breaks=True)
companies = []
places = []
countries = []
apple_suppliers_dict = {}
num_places = 1
num_countries = 1
for idx,elem in enumerate(elements):
elem_dict = elem.to_dict()
if elem_dict['type']=='PageBreak':continue
top_left = elem_dict['metadata']['coordinates']['points'][0]
top_left_x = top_left[0]
#Supplier name in first column
if 35<top_left_x<37:
company_text = elem_dict['text']
if company_text=="SUPPLIER NAME":continue
companies.append(company_text)
apple_suppliers_dict.update({company_text:{"places":[],"countries":[]}})
num_places = 1
num_countries = 1
# Location name in second column
elif 469<top_left_x<471:
place_text = elem_dict['text']
if place_text=='PRIMARY LOCATIONS WHERE MANUFACTURING FOR APPLE OCCURS': continue
places.append(place_text)
apple_suppliers_dict[company_text]['places'].append((place_text,num_places))
num_places+=1
# Country name in third column
elif 889<top_left_x<891:
country_text = elem_dict['text']
countries.append(country_text)
apple_suppliers_dict[company_text]['countries'].append((country_text,num_countries))
num_countries+=1
There was some missing data on the country name, so I had to do a manual adjustment
apple_suppliers_dict['Murata Manufacturing Company Limited'] = {'places': [('Guangdong, Jiangsu', 1),
('Fukui, Fukushima, Ishikawa, Miyagi, Miyazaki, Nagano, Okayama,Shiga, Shimane, Toyama', 2),
('Perak', 3),
('Singapore', 4),
('Taiwan', 5),
('Lamphun', 6),
('Da Nang, Tien Giang', 7)],
'countries': [('China mainland', 1),
('Japan', 2),
('Malaysia', 3),
('Singapore', 4),
('Taiwan', 5),
('Thailand', 6),
('Vietnam', 7)]}
faulty_dict = {}
for comp, place in apple_suppliers_dict.items():
if len(place['places'])!=len(place['countries']):
print(comp)
faulty_dict = apple_suppliers_dict[comp]
#No faulty dict found now
Preparing the final data for Apple
import pandas as pd
apple_df = pd.DataFrame(columns=['Supplier','Address','Countries'])
apple_comp = []
apple_places = []
apple_countries = []
for comp, place in apple_suppliers_dict.items():
for pl,vals in place.items():
if pl=="places":
for val in vals:
apple_comp.append(comp)
apple_places.append(",".join(val[0].split(",")))
elif pl=="countries":
for val in vals:
apple_countries.append(val[0])
apple_df['Supplier'] = apple_comp
apple_df['Address'] = apple_places
apple_df['Countries'] = apple_countries
For Samsung, we had a pandas data frame
To separate the country names and Addresses, we use the following helper functions
import re
def get_country(addr:str):
addr = addr.split(",")[-1]
addr = re.sub("\n","",addr)
return addr.strip()
def remove_country(addr:str):
addr = ",".join(addr.split(",")[:-1])
addr = re.sub("\n"," ",addr)
return addr.strip()
sam_df['Countries'] = sam_df['Address'].apply(get_country)
sam_df['Address'] = sam_df['Address'].apply(remove_country)
THE TWO CSV FILES USED FOR THIS HOMEWORK ARE BELOW (Google Drive links)
CREATING THE DATABASE
FROM ARROW TOOL
THE ABOVE SCHEMA IS A ROUGH SKETCH AND IN THE DATABASE, WE HAVE JUST CHANGED THE PROPERTY NAME LIKE `Company Name` BECAME name
A sample data insertion to Neo4j would look like the following
CREATE (apple:Company {name:'Apple
Inc.',headquarter:'Cupertino
California',sector:'Technology'}),(samsung:Company{name:'
Samsung Electronics Inc.',headquarter:'Suwon-si South
Korea',sector:’Technology’})
We have three primary nodes Apple Inc., Samsung Electronics Inc. and Suppliers company name, which has multiple nodes. The relationship between the supplier and company is named supplier and each supplier node has properties on address and country
THE DATA INSERTION CODE IS VERY LONG, FOR BREVITY IN THE REPORT, PLEASE REFER TO THESE GOOGLE DRIVE LINKS FOR THE CODE USED
- Apple Only data: (Example: CREATE (:Company {name:’Apple Inc.’})<-[:supplier]-(:Supplier {name:’3M’,short_address:’Alabama, Indiana, Iowa, Minnesota, Ohio, South Carolina, Wisconsin’,country:’United States’,full_address:’’}))
- Samsung only data: (Example:CREATE (:Company {name:’Samsung Electronics Inc.’})<-[:supplier]-(:Supplier {name:’Adeka Corporation’,short_address:’Jeollabuk-Do’,country:’Korea’,full_address:’Yuhang Rd, Shuyang Economic Develop Suqian City, Jiangsu Province’}))
- Both Apple and Samsung Supplier Data: (Example: CREATE (:Company {name:’Apple Inc.’})<-[:supplier]-(:Supplier {name:’AAC Acoustic Technologies Holdings Incorporated’,short_address:’Jiangsu’,country:’China’,full_address:’213167 №3 Changcao Road, Hi-Tech Industrial Zone, Wujin, Changzhou City, Jiangsu Province’})-[:supplier]->(:Company{name:’Samsung Electronics Inc.’}))
You can see in the common Apple and Samsung supplier are separated with a relation syntax. You don’t have to manually create these text files, you can use the following helper functions
APPLE ONLY SUPPLIERS
apple_file = open('apple.txt','a')
for idx,row in apple_df.iterrows():
supplier_name = row['Supplier']
if supplier_name in sam_suppliers: continue
short_address_name = row['Address']
country_name = row['Countries']
full_address_name = ""
# params = {"companyName":'Samsung Electronics Inc.',"supplierName":supplier_name,"short_address":short_address_name,"country":country_name,"full_address":{full_address_name}}
cypher_query = f"CREATE (:Company {{name:'Apple Inc.'}})<-[:supplier]-(:Supplier {{name:'{supplier_name}',short_address:'{short_address_name}',country:'{country_name}',full_address:'{full_address_name}'}})\n"
apple_file.write(cypher_query)
SAMSUNG ONLY SUPPLIERS
samsung_file = open('samsung.txt','a')
for idx,row in sam_procesed.iterrows():
supplier_name = row['Supplier']
if supplier_name in apple_suppliers: continue
short_address_name = row['Address']
country_name = row['Countries']
full_address_name = row['Full_Address']
prop_1 = "name"
# params = {"companyName":'Samsung Electronics Inc.',"supplierName":supplier_name,"short_address":short_address_name,"country":country_name,"full_address":{full_address_name}}
cypher_query = f"CREATE (:Company {{name:'Samsung Electronics Inc.'}})<-[:supplier]-(:Supplier {{name:'{supplier_name}',short_address:'{short_address_name}',country:'{country_name}',full_address:'{full_address_name}'}})\n"
samsung_file.write(cypher_query)
COMBINED
comb_file = open('comb.txt','a')
for idx,row in sam_procesed.iterrows():
supplier_name = row['Supplier']
if supplier_name not in apple_suppliers: continue
short_address_name = row['Address']
country_name = row['Countries']
full_address_name = row['Full_Address']
cypher_query = f"CREATE (:Company {{name:'Apple Inc.'}})<-[:supplier]-(:Supplier {{name:'{supplier_name}',short_address:'{short_address_name}',country:'{country_name}',full_address:'{full_address_name}'}})-[:supplier]->(:Company{{name:'Samsung Electronics Inc.'}})\n"
comb_file.write(cypher_query)
Now, you can copy paste these to Neo4j console and create the database
call db.schema.visualization;
DATA EXPLORATION
- Suppliers that are located in China who supply Apple
MATCH (c:Company{name:'Apple Inc.'})<-[sr:supplier]-
(s:Supplier {country:'China'}) RETURN c,s,sr limit 3;
- India-based suppliers to Samsung Electronics
MATCH (c:Company{name:'Samsung Electronics Inc.'})<-
[sr:supplier]-(s:Supplier {country:'India'}) RETURN c,s,sr;
- Common suppliers of Apple and Samsung based in China
MATCH (c:Company{name:'Samsung Electronics Inc.'})<-
[sr:supplier]-(s:Supplier {country:'China'})-[:supplier]-
>(c1:Company {name:'Apple Inc.'}) RETURN c1,c,s,sr limit 3;
- What does Samsung supply to Apple?
MATCH (c:Company{name:'Apple Inc.'})<-[:supplier]-
(s:Supplier) where s.name CONTAINS 'Samsung' return
c,s,s.country;
- REGIONS WHERE SAMSUNG SUPPLIES APPLE
MATCH (c:Company{name:'Apple Inc.'})<-[:supplier]-
(s:Supplier) where s.name CONTAINS 'Samsung' return
s.country,s.short_address;
- WHO ARE THE DISPLAY PROVIDERS FOR APPLE AND SAMSUNG?
APPLE
MATCH (c:Company{name:'Apple Inc.'})<-[:supplier]-
(s:Supplier) where s.name CONTAINS 'Display' return
s.country,s.short_address,s.name;
SAMSUNG
MATCH (c:Company{name:'Samsung Electronics Inc.'})<-
[:supplier]-(s:Supplier) where s.name CONTAINS 'Display'
return s.country,s.short_address,s.name;
- TSMC SUPPLY REGIONS
MATCH (c:Company{name:'Apple Inc.'})<-[:supplier]-
(s:Supplier) where s.name CONTAINS 'Taiwan Semiconductor'
return s.country,s.short_address;
- Which country has the most suppliers for Apple and Samsung respectively?
APPLE
MATCH (c:Company{name:'Apple Inc.'})<-[:supplier]-
(s:Supplier) return COUNT(*) AS `Suppliers Count`,s.country
order by `Suppliers Count` desc;
SAMSUNG
MATCH (c:Company{name:'Samsung Electronics Inc.'})<-
[:supplier]-(s:Supplier) return COUNT(*) AS `Suppliers
Count`,s.country order by `Suppliers Count` desc;
This brings us to the end of the tutorial on Neo4j, where we demonstrated an end-to-end pipeline for data exploration of the Apple and Samsung suppliers list. If you have any queries, I am one comment away.