Getting Facts data
In this chapter, we will read, write and understand typical Finance, Supply chain datasets in Julia language. These datasets will layout foundations for Finance, Supply Chain Anaytics.
Target Audience: This chapter, is meant for Julia Language consultants, ERP Analysts, IT Developers, Finance, Supply chain, HR & CRM managers, executive leaders or anyone curious to implement data science concepts in ERP space.
Once we understand these datasets, in following chapters, we will create Graphs, Vertices and Edges for data analysis, load actual data into Graphs.P2P.jl package supports these ERP systems data structures.
most of the time, Organizations store these datasets in RDBMS tables, document databases and in some case, actual PDFs, images serve as document data itself.
In previous chapter, we saw business process operational workflow diagram and also learned high level physical ERDs.
In this chapter, we will use Julia language and Julia packages to mimic Finance & Supply chain data. Majority of this data is sample data and DOES NOT bear any resemblance to real life data.
at the same time, I am using lots of actual real life data like UNSPSC codes, GUDID, Vendor, Item master data. Also, These datasets are not specific to any particular ERP systems like SAP, Oracle etc.
However, these sample dataset below are very close to real life data sets and are great assets to learn ERP systems architecture.
let's get started.
Background: Most of Enterprise ERP providers like SAP, Oracle, Microsoft build HCM, Finance, Supply Chain, CRM like systems, which store data in highly structured RDBMS tables. Recent advancements in ERP systems also support authoring non-structured data like digital invoices, receipt or hand-held OCR readers.
All of these ERP systems are great OLTP systems, but depend on Analytic systems for creating dashboards, ad-hoc analysis, operational reporting or live predictive analytics.
Further, ERP systems depend on ELT/ELT or 3rd party tools for data mining, analysis and visualizations.
While data engineers use Java, Scala, SPARK based big data solutions to move data, they depend on 3rd party BI Reporting tools for creating dashboards, use Data Mining tools for data cleansing and AI Languages for advance predictive analytics.
When I started learning more about Julia Language, I thought of using Julia Language to solve ERP Analytics multiple languages problem. Why not just use Julia Language to move, clean massive data set as Big data reporting solution, as Julia support multi-threading, distributing parallel computing. Julia language and associated packages has first class support for large arrays, which can be used for data analysis.
and Julia has great visualization packages to publish interactive dashboards, live data reporting.
best of all, Julia is great in numerical computing, advance data science machine learning.
This blog, I am sharing my notes specific to perform typical ERP data analysis using Julia Language.
- About ERP Systems, General Ledger & Supply chain
A typical ERP system consists of many modules based on business domain, functions and operations. GL is core of Finance and Supply chain domains and Buy to Pay, Order to Cash deal with different aspects of business operations in an Organization. Many organization, use ERPs in different ways and may chose to implement all or some of the modules. You can find examples of module specific business operations/processes diagram here.
- General Ledger process flow
- Account Payable process flow
- Tax Analytics
- Sample GL ERD - Entity Relaton Diagram
A typical ERP modules list looks like below diagram.
A typical ERP business process flow looks like below diagram.
A typical GL Balance sheet, Cash-flow or Income Statement looks like this
In this notebook, I will do my best to cite examples from real world data like above mentioned GL Financial statement.
start with Julia
It literally takes < 1 min to install Julia environments on almost any machine.
Here is link to my tutorial, which discuss Julia installation on different machines (including remote and mobile tablets).
adding Packages
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("CategoricalArrays")
using DataFrames, Dates, CategoricalArrays
Pkg.status()
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Status `~/.julia/environments/v1.7/Project.toml`
[336ed68f] CSV v0.10.4
[54eefc05] Cascadia v1.0.2
[324d7699] CategoricalArrays v0.10.6
[8f4d0f93] Conda v1.7.0
[a93c6f00] DataFrames v1.3.4
[31c24e10] Distributions v0.25.64
[e30172f5] Documenter v0.27.19
[8f5d6c58] EzXML v1.1.0
[708ec375] Gumbo v0.8.1
[cd3eb016] HTTP v0.9.17
[7073ff75] IJulia v1.23.3
[c601a237] Interact v0.10.4
[0f8b85d8] JSON3 v1.9.5
[b9914132] JSONTables v1.0.3
[4d0d745f] PDFIO v0.1.13
[c3e4b0f8] Pluto v0.19.9
[2dfb63ee] PooledArrays v1.4.2
[438e738f] PyCall v1.93.1
[88034a9c] StringDistances v0.11.2
[a2db99b7] TextAnalysis v0.7.3
[05625dda] WebDriver v0.1.2
[0f1e0344] WebIO v0.8.18
[fdbf4ff8] XLSX v0.7.10
[ade2ca70] Dates
[8bb1440f] DelimitedFiles
rest of this blog, I will assume, all packages are added and imported in current namespace/notebook scope.
# run one command at a time
repeat(["AMIT","SHUKLA"], inner=5) # repeat list/string number of times
fill("34", 4) # repeat list/string number of times
range(1.0, stop=9.0, length=100) # generate n number of equal values between start and stop values
11000:1000:45000 # genarate a range of # from start to finish with set intervals
collect(1:4) # collect funtion collect all values in list
rand([1,2,3,4]) # random value from a list of values
rand(11000:1000:45000) # random value from a list of values
randn() # random # from a list of float values (+ or -)
0.6806598133405635
using DataFrames, CategoricalArrays
# run one command at a time
# basic dataframe is constructed by passing column vectors (think of adding one excel column at a time)
org = "Apple Inc" # is a simple string
_ap = [1,2] # this is a vector
ap = categorical(_ap) # this is a vector
fy = categorical(repeat([2022], inner=2)) # this is a vector
actuals, budget = (98.40, 100) # this is a tuple
amount = (actuals = 98.54, budget = 100) # this is a named tuple
df_Ledger = DataFrame(Entity=fill(org), FiscalYear=fy, AccountingPeriod = ap, Actuals = actuals, Budget = budget)
# fill(org) or org will produce same results
# run one command at a time
# adding one row at a time, can be done, but is not very efficient
push!(df_Ledger, Dict(:Entity => "Google", :FiscalYear => 2022,
:AccountingPeriod => 1, :Actuals => 95.42, :Budget => 101))
push!(df_Ledger, Dict(:Entity => "Google", :FiscalYear => 2022,
:AccountingPeriod => 2, :Actuals => 91.42, :Budget => 99))
4 rows × 5 columns
Entity | FiscalYear | AccountingPeriod | Actuals | Budget | |
---|---|---|---|---|---|
String | Cat… | Cat… | Float64 | Int64 | |
1 | Apple Inc | 2022 | 1 | 98.4 | 100 |
2 | Apple Inc | 2022 | 2 | 98.4 | 100 |
3 | 2022 | 1 | 95.42 | 101 | |
4 | 2022 | 2 | 91.42 | 99 |
create Finance Data model
Chart of accounts (organized hierarchy of account groups in tree form), Location/Department or Product based hierarchy allows businesses to group and report organization activities based on business processes.
These hierarchical grouping help capture monetary and statistical values of organization in finance statements.
To create Finance Data model and Ledger Cash-flow or Balance Sheet like statements, We need associated dimensions (chartfields like chart of accounts).
We will discuss how to load actual data from CSV or RDBMS later. We will also learn how to group and create chartfield hierarchies later.
But for now, first Let's start with creating fake ACCOUNT, department and location chartfields.
using DataFrames, Dates
# create dummy data
accountsDF = DataFrame(
ENTITY = "Apple Inc.",
AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
ID = 11000:1000:45000,
CLASSIFICATION=repeat([
"OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=5),
CATEGORY=[
"Travel","Payroll","non-Payroll","Allowance","Cash",
"Facility","Supply","Services","Investment","Misc.",
"Depreciation","Gain","Service","Retired","Fault.",
"Receipt","Accrual","Return","Credit","ROI",
"Cash","Funds","Invest","Transfer","Roll-over",
"FTE","Members","Non_Members","Temp","Contractors",
"Sales","Merchant","Service","Consulting","Subscriptions"],
STATUS="A",
DESCR=repeat([
"operating expenses","non-operating expenses","assets","liability","net-worth","stats","revenue"], inner=5),
ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5));
show("Accounts DIM size is: "), show(size(accountsDF)), show("Accounts Dim sample: "), accountsDF[collect(1:5:35),:]
(nothing, nothing, nothing, 7×8 DataFrame
Row │ ENTITY AS_OF_DATE ID CLASSIFICATION CATEGORY ST ⋯
│ String Dates.Date Int64 String String St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Apple Inc. 1900-01-01 11000 OPERATING_EXPENSES Travel A ⋯
2 │ Apple Inc. 1900-01-01 16000 NON-OPERATING_EXPENSES Facility A
3 │ Apple Inc. 1900-01-01 21000 ASSETS Depreciation A
4 │ Apple Inc. 1900-01-01 26000 LIABILITIES Receipt A
5 │ Apple Inc. 1900-01-01 31000 NET_WORTH Cash A ⋯
6 │ Apple Inc. 1900-01-01 36000 STATISTICS FTE A
7 │ Apple Inc. 1900-01-01 41000 REVENUE Sales A
3 columns omitted)
There is lot to unpack here in above Julia code and lot is wrong (not best practice for sure).
First, what is a dataframe anyway, think of Julia DataFrame as tabular representation of data arranged in rows and columns. Unlike SQL, you should get into habit of reading and writing one column at a time (not because of reason, you can't read/write rows) for faster performance. Each column is an Array or a list of values, referred as vector.
Above Julia code creates accounts dataframe with columns name as ASOFDATE, DESCR, CATEGORY, ACCOUNT_TYPE, CLASSIFICATION, STATUS.
There are 35 rows, with same ASOFDATE, IDs starting from 11000-45000 in 1000 incremental values, all with STATUS = A (Active), 7 distinct Descriptions and account types (E=Expense, L=Liability, A= Assets, N=Net worth, S=Stats, R=Revenue) repeating 5 times per category.
For 35 rows, it's fine to store data like this, but now is a good time to learn about Categorical and Pooled Arrays, in case when dataframe has millions of rows.
Accounts chartfield
using Pkg, DataFrames, CategoricalArrays, PooledArrays, Dates
# here CLASSIFICATION column vector stores 3500 distinct values in an array
CLASSIFICATION=repeat(["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=500)
cl = categorical(CLASSIFICATION)
levels(cl)
# using PooledArrays
pl = categorical(CLASSIFICATION)
levels(pl)
# show values in tabular format
# run one command at a time
df = DataFrame(Dict("Descr" => "CLASSIFICATION...ARR...", "Value" => size(CLASSIFICATION)[1]))
push!(df,("CAT...ARR...",size(cl)[1]))
push!(df,("CAT...ARR..COMPRESS.",size(compress(cl))[1]))
push!(df,("POOL...ARR...",size(pl)[1]))
push!(df,("POOL...ARR..COMPRESS.",size(compress(pl))[1]))
push!(df,("CAT...LEVELs...",size(levels(cl))[1]))
push!(df,("POOL...LEVELs...",size(levels(pl))[1]))
push!(df,("CLASSIFICATION...MEMSIZE", Base.summarysize(CLASSIFICATION)))
push!(df,("CAT...ARR...MEMSIZE", Base.summarysize(cl)))
push!(df,("POOL...ARR...MEMSIZE", Base.summarysize(pl)))
push!(df,("CAT...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(cl))))
push!(df,("POOL...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(pl))))
first(df,5)
accountsDF = DataFrame(
ENTITY = "Apple Inc.",
AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
ID = 11000:1000:45000,
CLASSIFICATION=repeat([
"OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=5),
CATEGORY=[
"Travel","Payroll","non-Payroll","Allowance","Cash",
"Facility","Supply","Services","Investment","Misc.",
"Depreciation","Gain","Service","Retired","Fault.",
"Receipt","Accrual","Return","Credit","ROI",
"Cash","Funds","Invest","Transfer","Roll-over",
"FTE","Members","Non_Members","Temp","Contractors",
"Sales","Merchant","Service","Consulting","Subscriptions"],
STATUS="A",
DESCR=repeat([
"operating expenses","non-operating expenses","assets","liability","net-worth","stats","revenue"], inner=5),
ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
35 rows × 8 columns
ENTITY | AS_OF_DATE | ID | CLASSIFICATION | CATEGORY | STATUS | DESCR | ACCOUNT_TYPE | |
---|---|---|---|---|---|---|---|---|
String | Date | Int64 | String | String | String | String | String | |
1 | Apple Inc. | 1900-01-01 | 11000 | OPERATING_EXPENSES | Travel | A | operating expenses | E |
2 | Apple Inc. | 1900-01-01 | 12000 | OPERATING_EXPENSES | Payroll | A | operating expenses | E |
3 | Apple Inc. | 1900-01-01 | 13000 | OPERATING_EXPENSES | non-Payroll | A | operating expenses | E |
4 | Apple Inc. | 1900-01-01 | 14000 | OPERATING_EXPENSES | Allowance | A | operating expenses | E |
5 | Apple Inc. | 1900-01-01 | 15000 | OPERATING_EXPENSES | Cash | A | operating expenses | E |
6 | Apple Inc. | 1900-01-01 | 16000 | NON-OPERATING_EXPENSES | Facility | A | non-operating expenses | E |
7 | Apple Inc. | 1900-01-01 | 17000 | NON-OPERATING_EXPENSES | Supply | A | non-operating expenses | E |
8 | Apple Inc. | 1900-01-01 | 18000 | NON-OPERATING_EXPENSES | Services | A | non-operating expenses | E |
9 | Apple Inc. | 1900-01-01 | 19000 | NON-OPERATING_EXPENSES | Investment | A | non-operating expenses | E |
10 | Apple Inc. | 1900-01-01 | 20000 | NON-OPERATING_EXPENSES | Misc. | A | non-operating expenses | E |
11 | Apple Inc. | 1900-01-01 | 21000 | ASSETS | Depreciation | A | assets | A |
12 | Apple Inc. | 1900-01-01 | 22000 | ASSETS | Gain | A | assets | A |
13 | Apple Inc. | 1900-01-01 | 23000 | ASSETS | Service | A | assets | A |
14 | Apple Inc. | 1900-01-01 | 24000 | ASSETS | Retired | A | assets | A |
15 | Apple Inc. | 1900-01-01 | 25000 | ASSETS | Fault. | A | assets | A |
16 | Apple Inc. | 1900-01-01 | 26000 | LIABILITIES | Receipt | A | liability | L |
17 | Apple Inc. | 1900-01-01 | 27000 | LIABILITIES | Accrual | A | liability | L |
18 | Apple Inc. | 1900-01-01 | 28000 | LIABILITIES | Return | A | liability | L |
19 | Apple Inc. | 1900-01-01 | 29000 | LIABILITIES | Credit | A | liability | L |
20 | Apple Inc. | 1900-01-01 | 30000 | LIABILITIES | ROI | A | liability | L |
21 | Apple Inc. | 1900-01-01 | 31000 | NET_WORTH | Cash | A | net-worth | N |
22 | Apple Inc. | 1900-01-01 | 32000 | NET_WORTH | Funds | A | net-worth | N |
23 | Apple Inc. | 1900-01-01 | 33000 | NET_WORTH | Invest | A | net-worth | N |
24 | Apple Inc. | 1900-01-01 | 34000 | NET_WORTH | Transfer | A | net-worth | N |
25 | Apple Inc. | 1900-01-01 | 35000 | NET_WORTH | Roll-over | A | net-worth | N |
26 | Apple Inc. | 1900-01-01 | 36000 | STATISTICS | FTE | A | stats | S |
27 | Apple Inc. | 1900-01-01 | 37000 | STATISTICS | Members | A | stats | S |
28 | Apple Inc. | 1900-01-01 | 38000 | STATISTICS | Non_Members | A | stats | S |
29 | Apple Inc. | 1900-01-01 | 39000 | STATISTICS | Temp | A | stats | S |
30 | Apple Inc. | 1900-01-01 | 40000 | STATISTICS | Contractors | A | stats | S |
31 | Apple Inc. | 1900-01-01 | 41000 | REVENUE | Sales | A | revenue | R |
32 | Apple Inc. | 1900-01-01 | 42000 | REVENUE | Merchant | A | revenue | R |
33 | Apple Inc. | 1900-01-01 | 43000 | REVENUE | Service | A | revenue | R |
34 | Apple Inc. | 1900-01-01 | 44000 | REVENUE | Consulting | A | revenue | R |
35 | Apple Inc. | 1900-01-01 | 45000 | REVENUE | Subscriptions | A | revenue | R |
Categorical and Pooled Arrays as name suggests, are data structure to store voluminous data efficiently,specially when a column in a data frame has small number of distinct values (aka levels), repeated across entire column vector.
as an example, Finance Ledger may have millions of transactions and every row has one of these seven type of accounts. It's not recommended to store repeating value of entire string in every row. Instead, using a Categorical or PooledArray data type, memory/data size can be significantly reduced with out losing any data quality. (size(..) stays same for original, Categorical and PooledArray data type.
as you can see in above example, size of categorical / pooled array data type matches with original column vector but significantly reduces size/memory of data. (Base.summarysize(...)) is reduced 50% and is further reduced by 85% if used with compress(...))
Using Categorical Array type over PooledArray is recommended when there are fewer unique values, user need meaningful ordering and grouping. On the other hand, PoolArray is preferred when small memory usage is needed.
Department chartfield
using DataFrames
## create Accounts chartfield
# DEPARTMENT Chartfield
deptDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 1100:100:1500,
CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
CATEGORY=["sales","human_resource","IT_Staff","business","others"],
STATUS="A",
DESCR=[
"Sales & Marketing","Human Resource","Infomration Technology","Business leaders","other temp"
],
DEPT_TYPE=["S","H","I","B","O"]);
size(deptDF),deptDF[collect(1:5),:]
Location chartfield
using DataFrames
locationDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 11:1:22,
CLASSIFICATION=repeat([
"Region A","Region B", "Region C"], inner=4),
CATEGORY=repeat([
"Region A","Region B", "Region C"], inner=4),
STATUS="A",
DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
LOC_TYPE="Physical");
locationDF[:,:]
Ledger FACT
## pu
using DataFrames, Dates
accountsDF = DataFrame(
ENTITY = "Apple Inc.",
AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
ID = 11000:1000:45000,
CLASSIFICATION=repeat([
"OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=5),
CATEGORY=[
"Travel","Payroll","non-Payroll","Allowance","Cash",
"Facility","Supply","Services","Investment","Misc.",
"Depreciation","Gain","Service","Retired","Fault.",
"Receipt","Accrual","Return","Credit","ROI",
"Cash","Funds","Invest","Transfer","Roll-over",
"FTE","Members","Non_Members","Temp","Contractors",
"Sales","Merchant","Service","Consulting","Subscriptions"],
STATUS="A",
DESCR=repeat([
"operating expenses","non-operating expenses","assets","liability","net-worth","stats","revenue"], inner=5),
ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
# DEPARTMENT Chartfield
deptDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 1100:100:1500,
CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
CATEGORY=["sales","human_resource","IT_Staff","business","others"],
STATUS="A",
DESCR=[
"Sales & Marketing","Human Resource","Infomration Technology","Business leaders","other temp"
],
DEPT_TYPE=["S","H","I","B","O"]);
size(deptDF),deptDF[collect(1:5),:]
locationDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 11:1:22,
CLASSIFICATION=repeat([
"Region A","Region B", "Region C"], inner=4),
CATEGORY=repeat([
"Region A","Region B", "Region C"], inner=4),
STATUS="A",
DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
LOC_TYPE="Physical");
locationDF[:,:]
# creating Ledger
ledgerDF = DataFrame(
LEDGER = String[], FISCAL_YEAR = Int[], PERIOD = Int[], ORGID = String[],
OPER_UNIT = String[], ACCOUNT = Int[], DEPT = Int[], LOCATION = Int[],
POSTED_TOTAL = Float64[]
);
# create 2020 Period 1-12 Actuals Ledger
l = "Actuals";
fy = 2020;
for p = 1:12
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# create 2021 Period 1-4 Actuals Ledger
l = "Actuals";
fy = 2021;
for p = 1:4
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# create 2021 Period 1-4 Budget Ledger
l = "Budget";
fy = 2021;
for p = 1:12
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# here is ~3 million rows ledger dataframe
first(ledgerDF,5)
5 rows × 9 columns
LEDGER | FISCAL_YEAR | PERIOD | ORGID | OPER_UNIT | ACCOUNT | DEPT | LOCATION | POSTED_TOTAL | |
---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Int64 | Int64 | Int64 | Float64 | |
1 | Actuals | 2020 | 1 | ABC Inc. | Region A | 38000 | 1200 | 22 | 7.85632e7 |
2 | Actuals | 2020 | 1 | ABC Inc. | Region A | 33000 | 1500 | 15 | 5.4957e7 |
3 | Actuals | 2020 | 1 | ABC Inc. | Region B | 18000 | 1300 | 14 | 3.32866e7 |
4 | Actuals | 2020 | 1 | ABC Inc. | Region A | 43000 | 1200 | 16 | 9.49366e7 |
5 | Actuals | 2020 | 1 | ABC Inc. | Region A | 41000 | 1300 | 22 | 4.79162e7 |
Supply chain data model
you will need following packages.
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("CategoricalArrays")
Pkg.add("Interact")
Pkg.add("WebIO")
Pkg.add("CSV")
Pkg.add("XLSX")
Pkg.add("DelimitedFiles")
Pkg.add("Distributions")
Pkg.build("WebIO")
Pkg.status();
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Building WebIO → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/a8bbcd0b08061bba794c56fb78426e96e114ae7f/build.log`
Status `~/.julia/environments/v1.7/Project.toml`
[336ed68f] CSV v0.10.4
[54eefc05] Cascadia v1.0.2
[324d7699] CategoricalArrays v0.10.6
[8f4d0f93] Conda v1.7.0
[a93c6f00] DataFrames v1.3.4
[31c24e10] Distributions v0.25.64
[e30172f5] Documenter v0.27.19
[8f5d6c58] EzXML v1.1.0
[708ec375] Gumbo v0.8.1
[cd3eb016] HTTP v0.9.17
[7073ff75] IJulia v1.23.3
[c601a237] Interact v0.10.4
[0f8b85d8] JSON3 v1.9.5
[b9914132] JSONTables v1.0.3
[4d0d745f] PDFIO v0.1.13
[c3e4b0f8] Pluto v0.19.9
[2dfb63ee] PooledArrays v1.4.2
[438e738f] PyCall v1.93.1
[88034a9c] StringDistances v0.11.2
[a2db99b7] TextAnalysis v0.7.3
[05625dda] WebDriver v0.1.2
[0f1e0344] WebIO v0.8.18
[fdbf4ff8] XLSX v0.7.10
[ade2ca70] Dates
[8bb1440f] DelimitedFiles
rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.
Supply Chain Data
We already covered DataFrames and ERP Finance data model in previous sections. in below section, let's recreate all Supply Chain DataFrames to continue advance analytics and visualization.
All of Finance and supply chain data discussed here is also uploaded in GitHub repo under sampleData folder. This same script can be used to produce more voluminous data.
Dimensions
Item master, Item Attribs, Item Costing
UNSPSC: The United Nations Standard Products and Services Code® (UNSPSC®) is a global classification system of products and services. These codes are used to classify products and services.
GUDID: The Global Unique Device Identification Database (GUDID) is a database administered by the FDA that will serve as a reference catalog for every device with a unique device identifier (UDI).
GTIN: Global Trade Item Number (GTIN) can be used by a company to uniquely identify all of its trade items. GS1 defines trade items as products or services that are priced, ordered or invoiced at any point in the supply chain.
GMDN: The Global Medical Device Nomenclature (GMDN) is a comprehensive set of terms, within a structured category hierarchy, which name and group ALL medical device products including implantables, medical equipment, consumables, and diagnostic devices.
Vendor master, Vendor Attribs, Vendor Costing Customer/Buyer/Procurement Officer Attribs shipto, warehouse, storage & inventory locations
Transactions
- PurchaseOrder
- MSR - Material Service
- Voucher
- Invoice
- Receipt
- Shipment
- Sales, Revenue
- Travel, Expense, TimeCard
- Accounting Lines
Item Master
import Pkg
Pkg.add("XLSX")
Pkg.add("CSV")
using XLSX, CSV, DataFrames
###############################
## create SUPPLY CHAIN DATA ###
###############################
# Item master, Item Attribs, Item Costing ##
# UNSPSC, GUDID, GTIN, GMDN
############################################
##########
# UNSPSC #
##########
# UNSPSC file can be downloaded from this link https://www.ungm.org/Public/UNSPSC
# xf = XLSX.readxlsx("assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame
# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC", "A1:D12988"), :auto)
# dfUNSPSC = DataFrame(XLSX.readtable("assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC")...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame
# replace missing values with an integer 99999
# replace!(dfUNSPSC."Parent key", missing => 99999)
# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfUNSPSC)
# empty!(dfUNSPSC)
# Base.summarysize(dfUNSPSC)
# first(dfUNSPSC, 5)
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
GUDID Database
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
##########
# GUDID ##
##########
# The complete list of GUDID Data Elements and descriptions can be found at this link.
# https://www.fda.gov/media/120974/download
# The complete GUDID Database (delimited version) download (250+MB)
# https://accessgudid.nlm.nih.gov/release_files/download/AccessGUDID_Delimited_Full_Release_20220401.zip
# let's extract all GUDID files in a folder
# readdir(pwd())
# readdir("assets/sampleData/GUDID")
# since these files are in txt (delimited) format, we'll use delimited pkg
########################
## large txt files #####
## read one at a time ##
########################
# data, header = readdlm("assets/sampleData/GUDID/contacts.txt", '|', header=true)
# dfGUDIDcontacts = DataFrame(data, vec(header))
# data, header = readdlm("assets/sampleData/GUDID/identifiers.txt", '|', header=true)
# dfGUDIDidentifiers = DataFrame(data, vec(header))
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGUDIDcontacts),Base.summarysize(dfGUDIDidentifiers),Base.summarysize(dfGUDIDdevice)
# empty!(dfGUDIDcontacts)
# empty!(dfGUDIDidentifiers)
# empty!(dfGUDIDdevice)
# Base.summarysize(dfGUDIDcontacts),Base.summarysize(dfGUDIDidentifiers),Base.summarysize(dfGUDIDdevice)
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
##########
# GTIN ###
##########
# xf = XLSX.readxlsx("assets/sampleData/DS_GTIN_ALL.xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame
# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet", "A14:E143403 "), :auto)
# dfGTIN = DataFrame(XLSX.readtable("assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet";first_row=14)...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame
# replace missing values with an integer 99999
# replace!(dfUNSPSC."Parent key", missing => 99999)
# size(dfUNSPSC)
# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)
# readdir(pwd())
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGTIN)
# empty!(dfGTIN)
# Base.summarysize(dfGTIN)
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
##########
# GMDN ###
##########
## GMDN data is not available
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGMDN)
# empty!(dfGMDN)
# Base.summarysize(dfGMDN)
Vendor Master
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#################
# Vendor master #
#################
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# create Vendor Master from GUDID dataset
# show(first(dfGUDIDdevice,5), allcols=true)
# show(first(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]],5), allcols=true)
# names(dfGUDIDdevice)
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor = unique(dfGUDIDdevice[:,[:companyName]]) # 7574 unique vendors
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor is a good dataset, have 216k rows for 7574 unique vendors
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfVendor)
# empty!(dfVendor)
# Base.summarysize(dfVendor)
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#### Location Master
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfLocation)
# empty!(dfLocation)
# Base.summarysize(dfLocation)
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
creating complete Supply Chain Data Model DataFrames
now since we created Supply chain attribute, chartfields / dimensions
- item master
- vendor master
- location master
- org Hierarchy
using above chartfields, let's create following Supply Chain Transactions
- MSR - Material Service request
- PurchaseOrder
- Voucher
- Invoice
- Receipt
- Shipment
- Sales, Revenue
- Travel, Expense, TimeCard
- Accounting Lines
MSR - Material Service request
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# df GUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# dfOrgMaster = DataFrame(
# ENTITY=repeat(["HeadOffice"], inner=8),
# GROUP=repeat(["Operations"], inner=8),
# DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
# UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
# dfMSR = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# MSR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# FROM_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# TO_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(dfOrgMaster.UNIT, sampleSize));
# first(dfMSR, 5)
1000
PO - Purchase Order
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfPO = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# PO_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# VENDOR=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfPO, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Invoice - Voucher Invoice
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfVCHR = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# VCHR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Closed","Paid","Open","Cancelled","Exception"], sampleSize),
# VENDOR_INVOICE_NUM = rand(10001:9999999, sampleSize),
# VENDOR=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfVCHR, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Sales - Revenue Register
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfREVENUE = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# SALES_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Sold","Pending","Hold","Cancelled","Exception"], sampleSize),
# SALES_RECEIPT_NUM = rand(10001:9999999, sampleSize),
# CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfREVENUE, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Shipment - Receipt
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("assets/sampleData/GUDID/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
# dfSHIPRECEIPT = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# SHIP_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Shippped","Returned","In process","Cancelled","Exception"], sampleSize),
# SHIPMENT_NUM = rand(10001:9999999, sampleSize),
# CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfSHIPRECEIPT, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |