working with Data
In this section, we will review and create, ERP data sets required for system analytics.
Please see, most of the datasets discussed below, adhere to ERD discussed earlier. These datasets are not real but is very close to real life datasets.
These scripts do NOT copy any real organization dataset, any resemblance to any real organization working data is pure coincidental.
eCommerce.ai package supports following ERP systems data structures.
Oracle, PeopleSoft, SAP, Tally, Intuit, QuickBooks etc. I will cover examples from ERP Domains like GL (General Ledger), AP (Accounts Payable), AR (Account Receivables), B2P (Buy to Pay), Expense, Travel & Time, HCM Human Capital Management, CRM etc.
GitHub doesn't allow to load GBs/volumes data into GitHub repository.
While most of the scripts used in this project/package is tested on 20+GBs of data. You'll find sample datasets in ../../assets/sampleData directory.
Also, you can use following scripts to generate more data by changing sampleSize variable in following scripts.
Let's get started, and create ERP commerce dataset.
SUPPLY CHAIN
you will need following packages.
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("CategoricalArrays")
Pkg.add("Interact")
Pkg.add("WebIO")
Pkg.add("CSV")
Pkg.add("XLSX")
Pkg.add("DelimitedFiles")
Pkg.add("Distributions")
Pkg.build("WebIO")
Pkg.status();
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.7/Project.toml`
No Changes to `~/.julia/environments/v1.7/Manifest.toml`
Building WebIO → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/c9529be473e97fa0b3b2642cdafcd0896b4c9494/build.log`
Status `~/.julia/environments/v1.7/Project.toml`
[336ed68f] CSV v0.10.3
[54eefc05] Cascadia v1.0.1
[324d7699] CategoricalArrays v0.10.5
[8f4d0f93] Conda v1.7.0
[a93c6f00] DataFrames v1.3.2
[31c24e10] Distributions v0.25.53
[e30172f5] Documenter v0.27.15
[8f5d6c58] EzXML v1.1.0
[708ec375] Gumbo v0.8.0
[cd3eb016] HTTP v0.9.17
[7073ff75] IJulia v1.23.2
[c601a237] Interact v0.10.4
[0f8b85d8] JSON3 v1.9.4
[b9914132] JSONTables v1.0.3
[4d0d745f] PDFIO v0.1.13
[c3e4b0f8] Pluto v0.18.4
[2dfb63ee] PooledArrays v1.4.0
[438e738f] PyCall v1.93.1
[88034a9c] StringDistances v0.11.2
[a2db99b7] TextAnalysis v0.7.3
[05625dda] WebDriver v0.1.2
[0f1e0344] WebIO v0.8.17
[fdbf4ff8] XLSX v0.7.9
[ade2ca70] Dates
[8bb1440f] DelimitedFiles
rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.
All of Finance and supply chain data discussed here is also uploaded in GitHub repo under sampleData folder. This same script can be used to produce more voluminous data.
Chartfields
Item master, Item Attribs, Item Costing
UNSPSC: The United Nations Standard Products and Services Code® (UNSPSC®) is a global classification system of products and services. These codes are used to classify products and services.
GTIN: Global Trade Item Number (GTIN) can be used by a company to uniquely identify all of its trade items. GS1 defines trade items as products or services that are priced, ordered or invoiced at any point in the supply chain.
Vendor master, Vendor Attribs, Vendor Costing Customer/Buyer/Procurement Officer Attribs shipto, warehouse, storage & inventory locations
Transactions
- PurchaseOrder
- MSR - Material Service
- Voucher
- Invoice
- Receipt
- Shipment
- Sales, Revenue
- Travel, Expense, TimeCard
- Accounting Lines
Item Master from UNSPSC
import Pkg
Pkg.add("XLSX")
Pkg.add("CSV")
using XLSX, CSV, DataFrames
###############################
## create SUPPLY CHAIN DATA ###
###############################
# Item master, Item Attribs, Item Costing ##
# UNSPSC, GTIN
############################################
##########
# UNSPSC #
##########
# UNSPSC file can be downloaded from this link https://www.ungm.org/Public/UNSPSC
# xf = XLSX.readxlsx("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame
# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC", "A1:D12988"), :auto)
dfUNSPSC = DataFrame(XLSX.readtable("../../assets/sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC")...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame
# replace missing values with an integer 99999
replace!(dfUNSPSC."Parent key", missing => 99999)
# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfUNSPSC)
# empty!(dfUNSPSC)
# Base.summarysize(dfUNSPSC)
first(dfUNSPSC, 5)
5 rows × 4 columns
Key | Parent key | Code | Title | |
---|---|---|---|---|
Any | Any | Any | Any | |
1 | 100 | 99999 | A | Raw Materials, Chemicals, Paper, Fuel |
2 | 101 | 99999 | B | Industrial Equipment & Tools |
3 | 102 | 99999 | C | Components & Supplies |
4 | 103 | 99999 | D | Construction, Transportation & Facility Equipment & Supplies |
5 | 104 | 99999 | E | Medical, Laboratory & Test Equipment & Supplies & Pharmaceuticals |
GTIN data
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
##########
# GTIN ###
##########
# xf = XLSX.readxlsx("../../assets/sampleData/DS_GTIN_ALL.xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame
# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("../../assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet", "A14:E143403 "), :auto)
dfGTIN = DataFrame(XLSX.readtable("../../assets/sampleData/DS_GTIN_ALL.xlsx", "Worksheet";first_row=14)...)
first(dfGTIN,5)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame
# replace missing values with an integer 99999
# replace!(dfUNSPSC."Parent key", missing => 99999)
# size(dfUNSPSC)
# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)
# readdir(pwd())
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGTIN)
# empty!(dfGTIN)
# Base.summarysize(dfGTIN)
5 rows × 5 columns
150621010 | 10603295507444 | ATTUNE FB TIB BASE SZ 10 POR | Orthopaedics | EA | |
---|---|---|---|---|---|
Any | Any | Any | Any | Any | |
1 | 254505412-12 | 10603295480709 | ATTUNE RP PS TRL SZ 9 7MM | Orthopaedics | EA |
2 | L20409-13 | 10603295258117 | BROACH CORAIL AMT 9 | Orthopaedics | EA |
3 | 257004050-12 | 10603295490449 | C-STEM AMT SZ0-1 HI NECK | Orthopaedics | EA |
4 | 254505411-12 | 10603295480693 | ATTUNE RP PS TRL SZ 9 6MM | Orthopaedics | EA |
5 | L20416-13 | 10603295258186 | BROACH CORAIL AMT 16 | Orthopaedics | EA |
Vendor Master
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#################
# Vendor master #
#################
#####
## This is big data set (<5GB), uncomments and run only once ##
#####
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# show(first(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]],5), allcols=true)
# create Vendor Master from GUDID dataset
# show(first(dfGUDIDdevice,5), allcols=true)
# names(dfGUDIDdevice)
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor = unique(dfGUDIDdevice[:,[:companyName]]) # 7574 unique vendors
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor is a good dataset, have 216k rows for 7574 unique vendors
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfVendor)
# empty!(dfVendor)
# Base.summarysize(dfVendor)
Locations
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
#### Location Master
data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
dfLocation = DataFrame(data, vec(header))
first(dfLocation[:,Not(:zips)],5)
# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfLocation)
# empty!(dfLocation)
# Base.summarysize(dfLocation)
5 rows × 16 columns
city | city_ascii | state_id | state_name | county_fips | county_name | lat | lng | population | density | source | military | incorporated | timezone | ranking | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | |
1 | New York | New York | NY | New York | 36061 | New York | 40.6943 | -73.9249 | 18713220 | 10715 | polygon | FALSE | TRUE | America/New_York | 1 | 1840034016 |
2 | Los Angeles | Los Angeles | CA | California | 6037 | Los Angeles | 34.1139 | -118.407 | 12750807 | 3276 | polygon | FALSE | TRUE | America/Los_Angeles | 1 | 1840020491 |
3 | Chicago | Chicago | IL | Illinois | 17031 | Cook | 41.8373 | -87.6862 | 8604203 | 4574 | polygon | FALSE | TRUE | America/Chicago | 1 | 1840000494 |
4 | Miami | Miami | FL | Florida | 12086 | Miami-Dade | 25.7839 | -80.2102 | 6445545 | 5019 | polygon | FALSE | TRUE | America/New_York | 1 | 1840015149 |
5 | Dallas | Dallas | TX | Texas | 48113 | Dallas | 32.7936 | -96.7662 | 5743938 | 1526 | polygon | FALSE | TRUE | America/Chicago | 1 | 1840019440 |
ORG
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
dfOrgMaster = DataFrame(
ORG=repeat(["Galp Inc"], inner=8),
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
first(dfOrgMaster,5)
5 rows × 5 columns
ORG | ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|---|
String | String | String | String | String | |
1 | Galp Inc | HeadOffice | Operations | Procurement | Sourcing |
2 | Galp Inc | HeadOffice | Operations | Procurement | Sourcing |
3 | Galp Inc | HeadOffice | Operations | Procurement | Maintenance |
4 | Galp Inc | HeadOffice | Operations | Procurement | Support |
5 | Galp Inc | HeadOffice | Operations | Procurement | Services |
now since we created Supply chain attribute, chartfields / dimensions
- item master
- vendor master
- location master
- org Hierarchy
using above chartfields, let's create following Supply Chain Transactions
- MSR - Material Service request
- PurchaseOrder
- Voucher
- Invoice
- Receipt
- Shipment
- Sales, Revenue
- Travel, Expense, TimeCard
- Accounting Lines
MSR - Material Service request
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# df GUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# dfOrgMaster = DataFrame(
# ENTITY=repeat(["HeadOffice"], inner=8),
# GROUP=repeat(["Operations"], inner=8),
# DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
# UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
# dfMSR = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# MSR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# FROM_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# TO_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(dfOrgMaster.UNIT, sampleSize));
# first(dfMSR, 5)
1000
PO - Purchase Order
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfPO = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# PO_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# VENDOR=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfPO, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Invoice - Voucher Invoice
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfVCHR = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# VCHR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Closed","Paid","Open","Cancelled","Exception"], sampleSize),
# VENDOR_INVOICE_NUM = rand(10001:9999999, sampleSize),
# VENDOR=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfVCHR, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Sales - Revenue Register
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
dfOrgMaster = DataFrame(
ENTITY=repeat(["HeadOffice"], inner=8),
GROUP=repeat(["Operations"], inner=8),
DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"]);
# dfREVENUE = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# SALES_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Sold","Pending","Hold","Cancelled","Exception"], sampleSize),
# SALES_RECEIPT_NUM = rand(10001:9999999, sampleSize),
# CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfREVENUE, 5),allcols=true)
8 rows × 4 columns
ENTITY | GROUP | DEPARTMENT | UNIT | |
---|---|---|---|---|
String | String | String | String | |
1 | HeadOffice | Operations | Procurement | Sourcing |
2 | HeadOffice | Operations | Procurement | Sourcing |
3 | HeadOffice | Operations | Procurement | Maintenance |
4 | HeadOffice | Operations | Procurement | Support |
5 | HeadOffice | Operations | Procurement | Services |
6 | HeadOffice | Operations | HR | Helpdesk |
7 | HeadOffice | Operations | HR | ServiceCall |
8 | HeadOffice | Operations | MFG | IT |
Shipment - Receipt
# using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions
# sampleSize = 1000 # number of rows, scale as needed
# data, header = readdlm("../../assets/sampleData/device.txt", '|', header=true)
# dfGUDIDdevice = DataFrame(data, vec(header))
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# data, header = readdlm("../../assets/sampleData/uscities.csv", ',', header=true)
# dfLocation = DataFrame(data, vec(header))
# dfOrgMaster = DataFrame(
# ENTITY=repeat(["HeadOffice"], inner=8),
# GROUP=repeat(["Operations"], inner=8),
# DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
# UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])
# dfSHIPRECEIPT = DataFrame(
# UNIT = rand(dfOrgMaster.UNIT, sampleSize),
# SHIP_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
# STATUS=rand(["Shipped","Returned","In process","Cancelled","Exception"], sampleSize),
# SHIPMENT_NUM = rand(10001:9999999, sampleSize),
# CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
# GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
# QTY = rand(1:150, sampleSize),
# UNIT_PRICE = rand(Normal(100, 2), sampleSize)
# );
# show(first(dfSHIPRECEIPT, 5),allcols=true)
FINANCE Data model
Chart of accounts (organized hierarchy of account groups in tree form), Location/Department or Product based hierarchy allows businesses to group and report organization activities based on business processes.
These hierarchical grouping help capture monetary and statistical values of organization in finance statements.
To create Finance Data model and Ledger Cash-flow or Balance Sheet like statements, We need associated dimensions (chartfields like chart of accounts).
Accounts chartfield
using Pkg, DataFrames, CategoricalArrays, PooledArrays, Dates
# here CLASSIFICATION column vector stores 3500 distinct values in an array
CLASSIFICATION=repeat(["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=500)
cl = categorical(CLASSIFICATION)
levels(cl)
# using PooledArrays
pl = categorical(CLASSIFICATION)
levels(pl)
# show values in tabular format
# run one command at a time
df = DataFrame(Dict("Descr" => "CLASSIFICATION...ARR...", "Value" => size(CLASSIFICATION)[1]))
push!(df,("CAT...ARR...",size(cl)[1]))
push!(df,("CAT...ARR..COMPRESS.",size(compress(cl))[1]))
push!(df,("POOL...ARR...",size(pl)[1]))
push!(df,("POOL...ARR..COMPRESS.",size(compress(pl))[1]))
push!(df,("CAT...LEVELs...",size(levels(cl))[1]))
push!(df,("POOL...LEVELs...",size(levels(pl))[1]))
push!(df,("CLASSIFICATION...MEMSIZE", Base.summarysize(CLASSIFICATION)))
push!(df,("CAT...ARR...MEMSIZE", Base.summarysize(cl)))
push!(df,("POOL...ARR...MEMSIZE", Base.summarysize(pl)))
push!(df,("CAT...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(cl))))
push!(df,("POOL...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(pl))))
first(df,5)
accountsDF = DataFrame(
ENTITY = "Galp Retail Inc.",
AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
ID = 11000:1000:45000,
CLASSIFICATION=repeat([
"OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=5),
CATEGORY=[
"Travel","Payroll","non-Payroll","Allowance","Cash",
"Facility","Supply","Services","Investment","Misc.",
"Depreciation","Gain","Service","Retired","Fault.",
"Receipt","Accrual","Return","Credit","ROI",
"Cash","Funds","Invest","Transfer","Roll-over",
"FTE","Members","Non_Members","Temp","Contractors",
"Sales","Merchant","Service","Consulting","Subscriptions"],
STATUS="A",
DESCR=repeat([
"operating expenses","non-operating expenses","ASSETS","liability","net-worth","stats","revenue"], inner=5),
ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
accountsDF[collect(1:5),:]
5 rows × 8 columns
ENTITY | AS_OF_DATE | ID | CLASSIFICATION | CATEGORY | STATUS | DESCR | ACCOUNT_TYPE | |
---|---|---|---|---|---|---|---|---|
String | Date | Int64 | String | String | String | String | String | |
1 | Galp Retail Inc. | 1900-01-01 | 11000 | OPERATING_EXPENSES | Travel | A | operating expenses | E |
2 | Galp Retail Inc. | 1900-01-01 | 12000 | OPERATING_EXPENSES | Payroll | A | operating expenses | E |
3 | Galp Retail Inc. | 1900-01-01 | 13000 | OPERATING_EXPENSES | non-Payroll | A | operating expenses | E |
4 | Galp Retail Inc. | 1900-01-01 | 14000 | OPERATING_EXPENSES | Allowance | A | operating expenses | E |
5 | Galp Retail Inc. | 1900-01-01 | 15000 | OPERATING_EXPENSES | Cash | A | operating expenses | E |
Department chartfield
using DataFrames, Dates
## create Accounts chartfield
# DEPARTMENT Chartfield
deptDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 1100:100:1500,
CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
CATEGORY=["sales","human_resource","IT_Staff","business","others"],
STATUS="A",
DESCR=[
"Sales & Marketing","Human Resource","Information Technology","Business leaders","other temp"
],
DEPT_TYPE=["S","H","I","B","O"]);
deptDF[collect(1:5),:]
5 rows × 7 columns
AS_OF_DATE | ID | CLASSIFICATION | CATEGORY | STATUS | DESCR | DEPT_TYPE | |
---|---|---|---|---|---|---|---|
Date | Int64 | String | String | String | String | String | |
1 | 2000-01-01 | 1100 | SALES | sales | A | Sales & Marketing | S |
2 | 2000-01-01 | 1200 | HR | human_resource | A | Human Resource | H |
3 | 2000-01-01 | 1300 | IT | IT_Staff | A | Information Technology | I |
4 | 2000-01-01 | 1400 | BUSINESS | business | A | Business leaders | B |
5 | 2000-01-01 | 1500 | OTHERS | others | A | other temp | O |
Location chartfield
using DataFrames, Dates
locationDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 11:1:22,
CLASSIFICATION=repeat([
"Galp Region A","Galp Region B", "Galp Region C"], inner=4),
CATEGORY=repeat([
"Galp Region A","Galp Region B", "Galp Region C"], inner=4),
STATUS="A",
DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
LOC_TYPE="Physical");
locationDF[:,:]
12 rows × 7 columns
AS_OF_DATE | ID | CLASSIFICATION | CATEGORY | STATUS | DESCR | LOC_TYPE | |
---|---|---|---|---|---|---|---|
Date | Int64 | String | String | String | String | String | |
1 | 2000-01-01 | 11 | Galp Region A | Galp Region A | A | Boston | Physical |
2 | 2000-01-01 | 12 | Galp Region A | Galp Region A | A | New York | Physical |
3 | 2000-01-01 | 13 | Galp Region A | Galp Region A | A | Philadelphia | Physical |
4 | 2000-01-01 | 14 | Galp Region A | Galp Region A | A | Cleveland | Physical |
5 | 2000-01-01 | 15 | Galp Region B | Galp Region B | A | Richmond | Physical |
6 | 2000-01-01 | 16 | Galp Region B | Galp Region B | A | Atlanta | Physical |
7 | 2000-01-01 | 17 | Galp Region B | Galp Region B | A | Chicago | Physical |
8 | 2000-01-01 | 18 | Galp Region B | Galp Region B | A | St. Louis | Physical |
9 | 2000-01-01 | 19 | Galp Region C | Galp Region C | A | Minneapolis | Physical |
10 | 2000-01-01 | 20 | Galp Region C | Galp Region C | A | Kansas City | Physical |
11 | 2000-01-01 | 21 | Galp Region C | Galp Region C | A | Dallas | Physical |
12 | 2000-01-01 | 22 | Galp Region C | Galp Region C | A | San Francisco | Physical |
Ledger RECORD
## pu
using DataFrames, Dates
accountsDF = DataFrame(
ENTITY = "Galp Retail Inc.",
AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
ID = 11000:1000:45000,
CLASSIFICATION=repeat([
"OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
], inner=5),
CATEGORY=[
"Travel","Payroll","non-Payroll","Allowance","Cash",
"Facility","Supply","Services","Investment","Misc.",
"Depreciation","Gain","Service","Retired","Fault.",
"Receipt","Accrual","Return","Credit","ROI",
"Cash","Funds","Invest","Transfer","Roll-over",
"FTE","Members","Non_Members","Temp","Contractors",
"Sales","Merchant","Service","Consulting","Subscriptions"],
STATUS="A",
DESCR=repeat([
"operating expenses","non-operating expenses","ASSETS","liability","net-worth","stats","revenue"], inner=5),
ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5))
# DEPARTMENT Chartfield
deptDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 1100:100:1500,
CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
CATEGORY=["sales","human_resource","IT_Staff","business","others"],
STATUS="A",
DESCR=[
"Sales & Marketing","Human Resource","Information Technology","Business leaders","other temp"
],
DEPT_TYPE=["S","H","I","B","O"]);
size(deptDF),deptDF[collect(1:5),:]
locationDF = DataFrame(
AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"),
ID = 11:1:22,
CLASSIFICATION=repeat([
"Galp Region A","Galp Region B", "Galp Region C"], inner=4),
CATEGORY=repeat([
"Galp Region A","Galp Region B", "Galp Region C"], inner=4),
STATUS="A",
DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
LOC_TYPE="Physical");
locationDF[:,:]
# creating Ledger
ledgerDF = DataFrame(
LEDGER = String[], FISCAL_YEAR = Int[], PERIOD = Int[], ORGID = String[],
OPER_UNIT = String[], ACCOUNT = Int[], DEPT = Int[], LOCATION = Int[],
POSTED_TOTAL = Float64[]
);
# create 2020 Period 1-12 Actuals Ledger
l = "Actuals";
fy = 2020;
for p = 1:12
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# create 2021 Period 1-4 Actuals Ledger
l = "Actuals";
fy = 2021;
for p = 1:4
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# create 2021 Period 1-4 Budget Ledger
l = "Budget";
fy = 2021;
for p = 1:12
for i = 1:10^5
push!(ledgerDF, (l, fy, p, "Galp Inc.", rand(locationDF.CATEGORY),
rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
end
end
# here is ~3 million rows ledger dataframe
first(ledgerDF,5)
5 rows × 9 columns
LEDGER | FISCAL_YEAR | PERIOD | ORGID | OPER_UNIT | ACCOUNT | DEPT | LOCATION | POSTED_TOTAL | |
---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | String | String | Int64 | Int64 | Int64 | Float64 | |
1 | Actuals | 2020 | 1 | Galp Inc. | Galp Region B | 26000 | 1100 | 19 | 3.23507e7 |
2 | Actuals | 2020 | 1 | Galp Inc. | Galp Region B | 22000 | 1400 | 14 | 3.8979e7 |
3 | Actuals | 2020 | 1 | Galp Inc. | Galp Region B | 11000 | 1300 | 13 | 8.63023e7 |
4 | Actuals | 2020 | 1 | Galp Inc. | Galp Region C | 43000 | 1300 | 15 | 7.56502e7 |
5 | Actuals | 2020 | 1 | Galp Inc. | Galp Region A | 25000 | 1100 | 16 | 7.85415e7 |