Loading JSON … it looks simple … Part 4

This is part 4 and the final post of this JSON series. We are going to play with Smartsheet data because its adds to our examples of varying JSON structure. Note that Power BI already has a connector for Smartsheet. There is also a Python SDK for Smartsheet, but we are going to use the Python requests library so that we can start from scratch and learn how to navigate the JSON structure.

We will follow the same approach as the previous post i.e. focus on understanding the structure, draw from our existing patterns and add one or two new techniques to our arsenal.

Below is the screenshot of the actual Smartsheet we will be working with.

1100

Accessing the Smartsheet API

API access is available during the free trial period. If you want to play, have a look at how to generate the token so that you can authenticate your API access. If you have been following my posts, you would know by now that I store my secure information (such as API tokens) in the user environment variables on by development computer.

To access the API we need to create the endpoint from the Smartsheet API URL and the sheet ID for the sheet you want to access. We also need the token which we pass to the request header. The code to do this is as follows:

import json

auth = os.getenv('SMARTSHEET_TOKEN')
smartsheet = 'https://api.smartsheet.com/2.0/sheets/'
sheetid = '3220009594972036'
uri = smartsheet + sheetid
header = {'Authorization': "Bearer " + auth,
          'Content-Type': 'application/json'}

req = requests.get(uri, headers=header)
data = json.loads(req.text)

In the above script, the last line converts the request text into a Python dictionary structure.

Understanding the data structures

This is the critical step. Understand the problem … and in this case it means read the Smartsheet documentation and compare it to the physical structure. This time round we are going to use the pretty print (pprint) library for the latter.

import pprint
pprint.pprint(data)

The documentation tells us that there is a sheet which is the main object. The sheet contains the columns (metadata) object, a rows object and cells object. It also tells us that a collection of cells form a row. When we explore the physical structure we see that this is indeed the case.

1101

1102

So the column metadata and rows are referenced directly from the sheet object, and the cells are referenced from the row object. We could thus access the columns and rows dictionaries directly.

pprint.pprint(data['columns'])
pprint.pprint(data['rows'])

Building the dataframe

The process we are going to follow is to create an empty dataframe from the column metadata (using the ‘title’ key: value pair). We are then going to iterate over the rows and cells and append the values to the new dataframe.

We are firstly going to create a list of columns. This has a dual purpose. We will use it to create the empty dataframe, then we will re-use it to create row dictionaries that we will append to the dataframe.

cols = []
for col in data['columns']:
    cols.append(col['title'])

1103

Creating the empty dataframe from this is straightforward.

import pandas
df = pandas.DataFrame(columns=cols)

To append the rows, we iterate over the rows with a loop and we use a nested loop to iterate over the cells. The cells in each row are added to a list, but first we trap empty values in the VAT field.

for row in data['rows']:
    values = [] #re-initialise the values list for each row
    for cell in row['cells']:
        if cell.get('value'): #handle the empty values
            values.append( cell['value'])
        else:
            values.append('')
    print(values)

1104

Now we need a new trick to added these lists to the dataframe. We are going to use a function called zip. Zip must be used in conjunction with an iterator such as a list or dictionary.

1105

We are going to use this to combine our columns and rows into a dictionary that we append to the dataframe. The complete script is below.

import json, pandas

auth = os.getenv('SMARTSHEET_TOKEN')
smartsheet = 'https://api.smartsheet.com/2.0/sheets/'
sheetid = '3220009594972036'
uri = smartsheet + sheetid
header = {'Authorization': "Bearer " + auth,
          'Content-Type': 'application/json'}

req = requests.get(uri, headers=header)
data = json.loads(req.text)

cols = []
for col in data['columns']:
    cols.append(col['title'])

df = pandas.DataFrame(columns=cols)

for row in data['rows']:
    values = [] #re-initilise the values list for each row
    for cell in row['cells']:
        if cell.get('value'): #handle the empty values
            values.append( cell['value'])
        else:
            values.append('')
    df = df.append(dict(zip(cols, values)), ignore_index=True)

1106

The moral of the story

While JSON provides a great data interchange standard, implementation of the file format will vary. This means we will always need to make sure we understand the structure and select the required techniques to create an analysis ready format.

This post concludes the JSON series, but will revisit the topic when new scenarios pop up.

 

Tagged with: , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 3

This is part 3 of the introduction to handling JSON with Python. This post covers a real world example that accesses data from public API from the New York City Open Data. For fun we are going to look at the most popular baby names by sex and ethnicity.

The key idea is that JSON structures vary, and the first step in dealing with JSON is understanding the structure. Once we understand the structure, we select the appropriate techniques to handle the JSON data. Most of these techniques were covered in part 1 and part 2 of this series.

Browsing the data

You can browse the data from the API URL endpoint. The displayed format is not really going to help us understand the data. Copy the text over to an online JSON formatter such as jsonformatter. Here we see the pretty JSON format on the left:

1001

Click on the JSON Viewer button to see a tree view of the JSON data on the right:

1002

The very first thing you notice right at the top of the tree view is that there are 2 objects at the top level and the JSON consists of  nested dictionaries. By collapsing the tree structure to only the top level, we see the following:

1003

Here we see the meta dictionary has 1 sub-dictionary and the data dictionary is a list of 11345 items. Further exploration shows that the data dictionary items are also lists i.e. the data dictionary is a list of 11345 lists (rows). Each of these lists are a list of 14 items. Each item represents a value (column).

1004

The data we have only shows a list index and a list value. We don’t know what those values represent … yet.

Loading the data

It is reasonably easy to get to the data by referencing the data dictionary directly as shown in the script below:

import requests, pandas

url = 'https://data.cityofnewyork.us/api/views/25th-nujf/rows.json?accessType=DOWNLOAD'
req = requests.get(url)
mydict = req.json()

df = pandas.DataFrame.from_dict(mydict['data'])

The results show that the only problem is the column names are the index values.

1005

Loading the column meta data

We revisit our JSON tree view and find that the meta dictionary contains a view sub-dictionary, which in turn contains a list of dictionaries called columns. Each item in the columns list is a dictionary with 8 column properties.

1006

To access the column names is not that tricky. We will use the same techniques from the previous posts. Perhaps the only “trick” is to reference the nested columns dictionary. The rest of the code accesses the key: value pairs nested inside the list, and adds the column name value to our columns list. The final step replaces the column headers with the newly acquired column headers.

cols_lst = []

for obj in mydict['meta']['view']['columns']:
    for key, value in obj.items():
        if key == 'name': cols_lst.append(value)

df.columns = cols_lst

1007

You can clean up by removing the columns you don’t need (see previous blog posts using the drop method for dataframes).

When the problem is understood …

… the solution is (hopefully) obvious. While this post barely introduced anything new from a Python perspective (except referencing sub-dictionaries), the focus was on understanding the problem and cementing techniques for handling JSON.

In the next blog post I want to conclude this JSON series with another real world example.

Tagged with: , , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 2

The previous post might have left you feeling that its a bit too complicated and a bit too abstract. Dont worry, the concepts will be clearer after a bit of a visual recap and then exploring how we would navigate the JSON/dictionary structures and finally how we reconstruct a “flattened” dictionary from the original dictionary.

Visual recap

We discussed that the JSON format is the same as a Python dictionary i.e. it consists of key: value pairs separated by a comma and enclosed inside braces. The keys are just names and the values may be of type string, number, boolean, list or dictionary (object).

The image below is the JSON file we created in the previous post. It is a list of 3 dictionaries. Each dictionary has the same keys, but the values may differ. We access the values by referencing the keys.

901

Navigating the JSON file

We can iterate through the JSON file using the for loop structure. Lets start by loading the JSON file and then looping through the outer list of dictionary items. Our image above shows that there are 3 dictionary items. We will observe the same from the code below.

import json

with open('d:\\data\\json\\data.json') as json_file:
    mydict = json.load(json_file) 

#looping through each object in the list
for obj in mydict:
    print(obj)

902

In order to traverse the key: value pairs inside each dictionary, the for loop structure is slightly different. We will nest the dictionary for loop inside the list for loop so that we can access the key: value pairs for each of the 3 dictionaries.

for obj in mydict:
    for key, value in obj.items():
        print(mydict.index(obj), key, value)

For good measure, we have added in the index of the list item so that we know which dictionary the key: value pairs are associated with.

903

We know from the file structure that we still have list and dictionary values to deal with. We will choose our for loop structure according to the value type (list or dictionary). In order to get the values out of the “versions” list, we will use the list for loop structure nested inside our existing nested loop structure.

for obj in mydict:
    for key, value in obj.items():
        if type(value) == list:
            for item in value:
                print(mydict.index(obj), obj['product'], key, item)

Note how we test if the value type is a list. We are excluding all other key values.

904

To keep the data meaningful we have retained the dictionary index and product value that are associated with the list items.

We can use the same technique with the nested dictionary, by using the dictionary for loop structure.

for obj in mydict:
    for key, value in obj.items():
        if type(value) == dict:
            for key2, value2 in value.items():
                print(mydict.index(obj), obj['product'], key2, value2)

905

At this stage we have everything we need in terms of navigating the dictionary. We know the difference in for loop structure for lists vs dictionaries. We also know how to test the data type of the dictionary value and we know how to reference the keys, values and indexes.

Reconstructing  the dictionary

To create a flattened dictionary (no list or dictionary values), we will use a combination of navigation techniques just shown. The only additional techniques required are a bit of list and dictionary comprehension i.e. how to create and update a dictionary, and appending items to a list.

We create empty lists and dictionaries in the same way, but we append to these data structures in a different way.

907

We will now use the above in conjunction with our nested looping structures to create a flattened dictionary.

The principle of the technique is to isolate every key: value pair, including unpacking list or dictionary values. Each unpacked key: value pair is then added to a new dictionary that no longer has list of dictionary values. Finally we append each dictionary to create a new list of dictionaries, and the resulting dictionary list is added to a dataframe.


import json, pandas

with open('d:\\data\\json\\data.json') as json_file:
    mydict = json.load(json_file) 

dict_lst = [] #initialise and empty list to hold the final result

#outer loop through each dictionary
for obj in mydict:
    flat_dict = {} #initilaise a target dictionary
    #loop through each key: value pair in dictionary
    for key, value in obj.items():
        #test the data type of each value
        #handle list and dictionary types in the respective ways
        if type(value) == dict:
            for key2, value2 in value.items():
                #use both the original key and value as values in new dict
                flat_dict.update({'cheapest':key2, 'price':value2})
        elif type(value) == list:
            for item in value:
                #create a key of values version0, version1, version2, etc
                flat_dict.update({'version' + str(value.index(item)): item})
        else:
            flat_dict.update({key: value})
    #append the flattened dictionary to our new dictionary list
    dict_lst.append(flat_dict)

#add the dictionary list to a dataframe
df = pandas.DataFrame.from_dict(dict_lst)

906

This technique has greater flexibility in terms of allowing other ways of manipulating the dictionary. Ironically it also appears easier than the technique shown in the previous post where we used dataframes. Best of all, this technique forces a level of understanding of dictionary data structures.

In part 3 of the JSON series I want to explore a real world example that include JSON with a metadata section and retrieving data from an API.

Tagged with: , , ,
Posted in Power BI, Python, Python4PowerBI

Loading JSON … it looks simple … Part 1

In modern application environments we will encounter JSON (JavaScript Object Notation) as the preferred file format for data-interchange. Many APIs and files are delivered in JSON format. Working with JSON format could potentially be challenging, especially if you are new to it. It is easy to get thrown by values that are variable length lists or values that are dictionaries (nested objects).  So the three key things I want to cover in this post are:

  • The basics of the JSON and how it compares to a Python dictionary
  • Dealing with values that are lists (including variable length lists)
  • Dealing with nested objects (including where keys differ)

There will potentially be many opportunities to improve the code. Please let me know or comment so that I can learn in the process.

The JSON format is best understood by comparing it to a Python dictionary.  Be aware though that the JSON format (rules for data representation in text format) only appears similar to the Python dictionary (a data type that supports add, delete, sort, etc).

Python dictionaries

Dictionaries  consist of key: value pairs separated by a comma and enclosed inside braces. The value may be of type string, number, boolean, list or object (dictionary).

string_dict   = {"product": "Power BI"}
number_dict   = {"year": 2015}
boolean_dict  = {"free": True}
array_dict    = {"paid": ["pro", "premium"]}
object_dict   = {"price": {"pro": "$10"} }

mydict = {"product": "Power BI",
          "year": 2015,
          "free": True,
          "paid": ["pro", "premium"],
          "cheapest": {"pro": "$10"}}

To access the values we reference the keys:

801

Note how the second example combines a key with a list index, and the third example combines the key for the outer object with the key of the nested object.

List of dictionaries

We can also create an array (list) of dictionaries:

dict_array    = [
                {"product": "Power BI",
                 "year": 2015,
                 "free": True,
                 "versions": ["free",
                              "pro",
                              "premium"],
                 "cheapestpaid": {"pro": "$10"}},
                {"product": "Tableau",
                 "year": 2004,
                 "free": False,
                 "versions": ["creator",
                              "explorer",
                              "viewer"],
                 "cheapestpaid": {"creator": "$70"}},
                {"product": "QlikSense",
                 "year": 2015,
                 "free": True,
                 "versions": ["cloudbasic",
                              "cloudbusiness",
                              "desktop",
                              "enterprise"],
                 "cheapestpaid": {"cloudbusiness": "$15"}}
                ]

We can reference a specific array item by its index, then the dictionary key as follows:

802

Here the list index precedes the dictionary key. Remember that the list index offset is 0.

Creating a JSON file from a dictionary

First we convert the dictionary to a JSON string using the built in json package. The json.dumps function converts a dictionary into JSON encoded format. Add the following to the code above:

import json
print(dict_array)
json.dumps(dict_array)

Compare the output of the dictionary and the JSON encoded string. The appearance is almost identical. The dictionary displays the key/value pairs in single quotes, while the JSON string displays the key/value pairs in double quotes and the entire string in single quotes.

803

Now we can write the JSON string to a file. Here we use json.dump (without the s) which allows us to write the file rather than a unicode string.

with open('d:\\data\\json\\data.json', 'w') as outfile:
    json.dump(dict_array, outfile, indent=4)

The indent=4 parameter produces a pretty json layout i.e. human readable format.

Load the JSON file

In the same manner that we wrote the JSON file, we can read the file using the json.load function. At this point the file content is loaded as a list of dictionaries.

with open('d:\\data\\json\\data.json') as json_file:
    dict_lst = json.load(json_file)

Adding the dictionary to a dataframe

import pandas
df = pandas.DataFrame.from_dict(dict_lst)

From the output we can see that we still need to unpack the list and dictionary columns.

808

One of the methods provided by Pandas is json_normalize. This works well for nested columns with the same keys … but not so well for our case where the keys differ.

The dictionary you wish you got

Here is a simple example where the keys are the same:

mydict = [{"company": "ABC",
           "location": {"province": "Gauteng",
                        "city": "Sandton" }},
          {"company": "XYZ",
           "location": {"province": "Western Cape",
                      "city": "Cape Town" }}]

import pandas
pandas.DataFrame.from_dict(mydict)

from pandas.io.json import json_normalize
json_normalize(mydict)

809

Back to the real world

In our example, our keys dont match across dictionaries, so the result is not great.

from pandas.io.json import json_normalize
json_normalize(dict_lst)

810

Because the keys differ they are split across different columns. Also the list column is not handled. Lets use a different technique to handle the list column. Pandas apply allows us to apply operations across dataframes (rows/columns). Here we use Pandas Series to create a column for each list item.

import pandas
df = pandas.DataFrame.from_dict(dict_lst)
df2 = df['versions'].apply(pandas.Series)

811

The resulting dataframe can be concatenated with the existing one as follows:

 
df3 = pandas.concat([df[:], df2[:]], axis=1)

812

So we still have to deal with the dictionary column. We will use dataframe methods (melt, merge, drop) that were covered in previous posts, along with some new methods. I’ve taken a different approach here by providing all the steps with comments and then a visual that shows the output at each step.

import pandas
df = pandas.DataFrame.from_dict(dict_lst)
#split the versions list column into individual columns
df1 = df['versions'].apply(pandas.Series)
#concatenate the original dataframe with the individual version columns
df2 = pandas.concat([df[:], df1[:]], axis=1)
#split the dictionary column into individual columns
df3 = pandas.DataFrame(df2['cheapestpaid'].values.tolist(), index=df2.index)
#add the product column (to create a lookup column)
df4 = pandas.concat([df['product'], df3[:]], axis=1)
#melt (unpivot) the value columns
df5 = pandas.melt(df4, id_vars=['product'], value_vars=['cloudbusiness','creator','pro'], var_name='cheapest')
#drop NaN values to finalise the lookup table
df6 = df5.dropna()
#merge with the lookup table
df7 = pandas.merge(df2, df6, on='product', how='outer')
#drop unwanted columns
df8 = df7.drop(['cheapestpaid','versions'], axis=1)

813

So the result is not the simplest of code, however it is not that complex either. We solved the entire problem using Pandas. What I have not tested is how we would solve this without Python in Power BI (visual Query Editor or M). It would also be interesting to compare solution complexity. In my case, I already had these Python patterns, so it was a matter of script re-use.

In the next post I want to explore methods of traversing and reshaping the data without Pandas and dataframes. It will provide better understanding of the JSON format and expand on the possibilities with Python script.

 

 

 

Tagged with: , ,
Posted in Power BI, Python, Python4PowerBI

Reshaping ugly CSV with Python

This is another post based on a demo from my SQLSaturday session about Python integration in Power BI. The data file is also available from this link or download your own file from the Reserve Bank link below.

It is a pattern I use fairly often because flat files lend itself to the artistic license of the data provider i.e. they can do just about anything they want with the flat file.

Lets take a look at a real world example. The South African Reserve Bank provides daily exchange rate data (ZAR per USD) through an embedded Reporting Services environment. When you download the data as Comma Separated Value (CSV), this is what you get:

701

The data represents 3 days of data. We only require the section indicated as “useful data”. If you look at the file structure carefully, you will notice that it could still be treated as a CSV file. When you open the file with the Query Editor, you will see that Power BI already treats the file like a CSV (or an Excel file that does not have all cells populated).

702

I am going to illustrate two methods of dealing with this in Python. You may encounter scenarios where either one could be useful.

Method 1: Treat the file as CSV

This method will essentially mimic what the Power BI Query Editor does with the file. We will use the Pandas read_csv method, but we will use some parameters to handle our requirements. Without indicating that there are three columns, read_csv will throw an exception when it encounters a third field further down the file. To indicate three columns we simply specify the columns.

import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])

You will notice that we have ignored the first four rows that we don’t need.

703

From this point on its a simple process that we have encountered in previous posts. We are going to do three things to get the file into a ready to use state:

  • Drop the extra column
  • Remove (filter) unwanted rows
  • Explicitly cast the data types
import pandas
df = pandas.read_csv('D:\\data\\dirty\\ExchangeRateDetail.csv',
                     skiprows=4,
                     names =['Date','Value','Extra'])
df = df.drop(['Extra'], axis=1) # drop "Extra" column
df = df[df.Date != 'The Value'] # filter rows
df = df[df.Date != 'The Value_label'] # filter rows
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final output can now be used as a Python data source in Power BI

704

Method 2: Read the file line by line

In this example we will read the data line by line. Each line will get add to a list i.e. we will end up with a list of lines. We will then use some list comprehension techniques filter the data before adding it to a dataframe.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line for line in f]

The new line characters form part of the line as shown below in our list where each item is a line in the file.

705

There are also special characters at the start of the file, but we dont need to worry about this as we will not use this item in the list.

To strip off the “\n” we will use the rstrip method which removes characters from the right hand side of a string. By not specifying a character, whitespace characters are removed. This includes “\n”.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

706

With the newline removed, we can now start filtering the list of items(lines). We start by filtering out the empty items (blank lines).

lines = list(filter(None, lines))

707

Next we remove the first 3 items, but keep the rest of the list.

lines = lines[3:]

708

Our final filter condition is to remove all items that start with “The”

lines = [t for t in lines if not t.startswith('The')]

709

Now our list is ready to convert to a dataframe.

import pandas
df = pandas.DataFrame(lines)

710

Note that we end up with a single column (0) dataframe. Our final steps to get the dataframe into a ready to use state are:

  • split the column on comma separator
  • drop the original column
  • Explicitly cast the data types

Lets put a complete final script together that include the final steps.

import pandas

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
    lines = [line.rstrip() for line in f]

lines = list(filter(None, lines)) # drop blank lines
lines = lines[3:] # remove top rows
lines = [t for t in lines if not t.startswith('The')] # remove rows starting with "The"

df = pandas.DataFrame(lines) # add remaining rows to a dataframe
df['Date'] = pandas.to_datetime(df[0].str.split(',').str.get(0)) # split of the Date column
df['Value'] = df[0].str.split(',').str.get(1) # split off the value column
df = df.drop([0], axis=1) # drop the original column (0)
df['Date'] = pandas.to_datetime(df['Date']) # explicit data type cast
df['Value'] = pandas.to_numeric(df['Value']) # explicit data type cast

The final result is the same as method 1.

Hopefully that provides some ideas on how to deal with flat files in general. As mentioned at the start of this post, it is often challenging to deal with flat files as it is up to the data provider to produce a well structured file.

Posted in Power BI, Python, Python4PowerBI

Unpivot Excel data with Python

This post is is based on another demo from my SQLSaturday session on Python integration in Power BI. Its a handy real world scenario that could easily be re-applied with minor changes to the script. The primary pattern is unpivot functionality which is achieved through the melt method in pandas.  A few additional useful dataframe techniques are illustrated, including pivot_table.

I am using the Tourist Accommodation data from the Statistics South Africa website. The file (tourism.xlsx) is also available on the SQLSaturday website. Once you open the file observe the following:

601

  • the first 8 columns (starting with the letter H) are the header columns
  • the remaining columns are a series of months i.e. MO012007 is month 1 (Jan), 2007
  • the numeric values have mixed scale including thousands, millions & percentages
  • column H04 is a mix of metric and accommodation type

The Python script will provide a way to transform the data even if the columns increase as months get added.

Step 1: Load Excel data into a dataframe

import pandas
df = pandas.read_excel('D:\\data\\excel\\tourism.xlsx')

The read_excel method accepts a host of parameters, including specifying the sheet_name. Our Excel file has only 1 sheet, so the sheet_name is not required.

Step 2: Split columns into actual headers and months

We start by adding the dataframe columns to a list. Then we loop through the list and split the month columns from the actual headers.

#create a list of all the columns
columns = list(df)
#create lists to hold headers & months
headers = []
months = []
#split columns list into headers and months
for col in columns:
    if col.startswith('MO'):
        months.append(col)
    else:
        headers.append(col)

Here we have the best case scenario because all month columns start with “MO”. If no pattern exists and the columns are sequenced, you could split on position i.e. headers = columns[:8], months = columns[8:] will produce the same result. Worst scenario is when the order of the columns are random. In this case you would manually specify the list elements.

Step 3: Melt (unpivot) the month columns

Unpivoting will transpose all month columns to rows. For each month column a new row is created using the same header columns.

df2 = pandas.melt(df,
                  id_vars=headers,
                  value_vars=months,
                  var_name='Date',
                  value_name='Val')

All month columns will be dropped and a new column ‘Date’ will contain the original month column header. The values will bed added to a new column called ‘Val’.

602

Now reconstruct and convert the Date column to an actual date. We get the year and month parts from the column and just assume the day is always 1. This is explicitly converted to a date using the to_datetime method.

df2['Date'] = pandas.to_datetime(df2['Date'].str[4:9] + '-' + df2['Date'].str[2:4] + '-01')

Step 4: Standardize the values 

We noted at the start that the values were a mixture of numeric scales which is indicated in column H17. To determine the possible value scales we use df2.H17.unique() which returns the following list of unique values: [‘Thousand’, ‘Percentage’, ‘Rands’, ‘R million’]. We create a function to apply the scale to the Val column.


 def noscale(row):
    if row['H17'] == 'Thousand':
        return row['Val'] * 1000
    elif row['H17'] =='R million':
        return row['Val'] * 1000000
    elif row['H17'] =='Percentage':
        return row['Val'] / 100
    else:
        return row['Val']

df2['FullValue'] = df2.apply(noscale, axis=1)

“Rands” are just the local currency (ZAR), so no scale is required.

603

Step 5: Additional clean up 

The next 3 steps in our script splits the composite column and removes unwanted rows and columns.


#split composite column
df2['Metric'] = df2['H04'].str.split(' - ').str.get(0)
df2['Category'] = df2['H04'].str.split(' - ').str.get(1)
#remove aggregate rows
df2 = df2[df2.Category != 'Total industry']
#remove unwanted columns
df2 = df2.drop(['H01','H02','H03','H04','H15','H17','H25','Val'], axis=1)

Note how rows are filtered. We set the dataframe equal to the existing dataframe where the Category column is not equal to ‘Total Industry’

604

At this point the script could be used as is, but in Power BI we would need to understand how the measure(s) would be created. For example, we could have a generic measure [Value] = SUM(‘Tourism'[FullValue]). We would then use the Metric column as a filter/slicer. Most likely you would rather want a measure such as [Units Available] = CALCULATE ( SUM ( ‘Tourism'[FullValue] ), ‘Tourism'[Metric]=”Stay units available” ). In this case it is cumbersome to create the measures, so we rather reshape the table to make things easier.

Step 6: Pivot the Metrics (optional)

If the metrics provided are fixed (consistent), then pivoting the metrics  is a good idea. This creates a column for each metric. This is possible with the Pandas pivot_table method.

df3 = pandas.pivot_table(df2,
                         values='FullValue',
                         columns=['Metric'],
                         index=['H16','Date','Category'])

From the results below, this appears to be correct.

605

The side effect of the pivot_table method is that the dataframe is re-indexed based on the index parameter passed to the method. This is not really a problem, but for Power BI it is. You may have noted from earlier posts that the Python Script in Power BI does not provide access to the dataframe index. So we would in effect lose the first three columns from our result above.

The fix is as simple as re-indexing the dataframe. We can also drop the ‘Total income’ column as we can let Power BI handle the aggregations.

df3.reset_index(inplace=True)
df3 = df3.drop(['Total income'], axis=1)

Our final script below will produce 3 dataframes which will be available for selection in Power BI.

import pandas

#read Excel into dataframe
df = pandas.read_excel('D:\\data\\excel\\tourism.xlsx')

#create place holder lists for actaul headers vs months
headers = []
months = []

#split columns into headers and months
for col in list(df):
    if col.startswith('MO'):
        months.append(col)
    else:
        headers.append(col)

#unpivot months
df2 = pandas.melt(df,
                  id_vars=headers,
                  value_vars=months,
                  var_name='Date',
                  value_name='Val')

#create a date field from the month field
df2['Date'] = pandas.to_datetime(df2['Date'].str[4:] + '-' + df2['Date'].str[2:4] + '-01')

#create a function to resolve mixed numeric scale in the data
def noscale(row):
    if row['H17'] == 'Thousand':
        return row['Val'] * 1000
    elif row['H17'] =='R million':
        return row['Val'] * 1000000
    elif row['H17'] =='Percentage':
        return row['Val'] / 100
    else:
        return row['Val']

#apply the noscale function to the dataframe
df2['FullValue'] = df2.apply(noscale, axis=1)

#split composite column
df2['Metric'] = df2['H04'].str.split(' - ').str.get(0)
df2['Category'] = df2['H04'].str.split(' - ').str.get(1)
#remove aggregate rows
df2 = df2[df2.Category != 'Total industry']
#remove unwanted columns
df2 = df2.drop(['H01','H02','H03','H04','H15','H17','H25','Val'], axis=1)

#unpivot metrics
df3 = pandas.pivot_table(df2,
                         values='FullValue',
                         columns=['Metric'],
                         index=['H16','Date','Category'])
#reset index
df3.reset_index(inplace=True)
#drop aggregate column
df3 = df3.drop(['Total income'], axis=1)

All that is left is to paste your script into Power BI and select the dataframe that works best for you.

606

Of course go and experiment with Python visualization 😊

 

 

Posted in Power BI, Python, Python4PowerBI

Power BI date table with Python

I spoke about Python integration in Power BI at SQLSaturday #793 in Cape Town this past weekend. You can download the deck (which includes the notes) and the Python scripts (which include sample data) from the SQLSaturday site. One of the demos included the creation of a date table to support time intelligence DAX functions in Power BI. I thought the example would be interesting for a number of reasons:

  • Time intelligence is one of my favorite features of Power BI (Power Pivot & SSAS)
  • A few examples of M script to create date tables already exist (it would be interesting to compare)
  • Illustrate some handy dataframe capability (including manual creation of utility dataframes, and treating dataframes like tables to perform join operations)

Before we start, lets quickly recap the basic criteria for a date table in Power BI:

  • A separate date table is required (and joined to the table to calculate against)
  • The date table must be at a day level (1 row equals 1 day)
  • The days must be contiguous (no gaps even if your data is not daily)
  • The days must be available for full years (1 Jan to 31 Dec)
  • A column of data type ‘Date’ must exist on the date table

Step 1: Create a list of sequential dates

For this we use the Pandas date_range method.

501

The output shows the first 10 and last 10 elements of the list, the data type and the number of elements (length).

Step 2: Create a dataframe from the list

A list can easily be converted into a dataframe. The result is a dataframe with one column.

502

The column is assigned a default name of 0. The second line renames the column.

Step 3: Derive additional columns

For this step we make use of the string format time (strftime) method to derive required columns from the Date column. Where a numeric data type column is required, we explicitly convert the column using the dataframe apply function.

503

At this step you already have a basic date table. In many cases financial calendars do not end in December. In this example the financial year ends in March.

Step 4: Create a financial calendar lookup dataframe

I took a different approach with the financial calendar columns because I could easily do this inside the script with Python. Instead of using conditional evaluation (if, case, switch, etc), I decided to create a lookup “table” that a could join to. The lookup table would either provide the column I need, or help me create the column I need.

504

In the creation of this fy lookup dataframe, each column is provided as a list input i.e. each row in the code adds a column to the dataframe. All the lookup dataframe does is remap the financial periods to start in April (calendar month 4). The FinYearDif column will help us calculate the financial year in the next step.

Step 5: Add the financial year periods

We do this by doing a left outer join to the lookup dataframe. This is achieved through the merge method. We specify MonthNumber as the join column and outer as the join type. Finally we derive the financial year and then drop the helper column (axis=1 indicates the column axis).

505

Add to Power BI with the Python Script …

Bonus step: Persist result to a SQL Server database

I hinted previously that its easy to persist the data to a SQL Server database using the dataframe.to_sql method. Add this after your code:

506

You may want to revisit the post on SQL Server connectivity. The last line of code is all that’s new.

A thought

I added the to_sql method and used this in my Python script for Power BI. If this will be supported in the Power BI Service, I may consider writing to SQL Server in the Python script, then reading the result from SQL Server. I could even run stored procedures that only load deltas to SQL Server. Should something like this be allowed in the Power BI Service, it could mean that a Power BI refresh also runs my (ETL) process to persist data to SQL Server.

Download the script

As mentioned at the start of this post, this was part of a SQL Saturday demo. The complete script (and more) is available from the SQL Saturday website for download.

I’ll unpack the other examples in future posts.

 

Posted in Power BI, Python, Self-Service BI

Python & Power BI: Load from SQL Server

This is part 4 of the Python integration in Power BI series. We are going to explore sourcing data from SQL Server (including Azure SQL DB). If you have been following the series, you would be familiar with the the basics of Pandas dataframes. You saw how easy it is to load data from a CSV file into a dataframe. Loading data from SQL Server is just as easy … but the hard work is setting up database connection strings.

This post will cover how to connect to SQL Server with a library called SQLAlchemy, and how to load data from SQL Server into a Pandas dataframe. We cover a few other things along the way i.e. using Windows environment variables, multi-line strings and working with string parameters.

Protecting sensitive information

In the previous post we worked with the os (operating system library) to access files. We will use the os library again, but this time to protect sensitive information. During the development process I like to keep keys, passwords, server names, etc outside of the script in a place that is not too easy to access. I store this info in User Environment Variables so that are only accessible to me. (Any other ideas to protect sensitive info welcome).

The os library has a getenv (get environment variable) method. We retrieve the value from the environment variable and assign it to a Python variable as follows:

401.png

Here you can see that two values are read from the environment variables and we see that sqldb has a value ‘Sandbox’. In this way we ensure that our Python scripts don’t store sensitive information, but will still run successfully in our development environment. If you have problems with the environment variables, just use plain text for now.

SQLAlchemy connection strings

SQLAlchemy is quite sophisticated, but we will using it in a fairly limited way. The “engine” or connection is covered in the documentation, but rarely will we have a standard or basic connection. Most of the time we will have a non-standard connection, in which case we have to pass the connection parameters in a connection string. SQLAlchemy expects this string in URL format.

We will start by connecting to a SQL instance on a local machine. First we add the sqlalchemy library to our script. Then we store the connection string in a variable. 402.png

To convert the connection string to URL format, we make use of the urllib library. Here we use the quote_plus method to convert spaces and special characters to URL format.

403

Maintaining connection strings in the above format is error prone and cumbersome. Python allows multi-line strings through the use of triple quotes. This can be either single quotes or double quotes.

404

Now we want to include the variables we created earlier. We are not going to construct the string from parts using concatenation. We are rather going to use the % placeholder for parameters in the string. We are only working with strings, so we will use %s as the placeholder. Note that they are interpreted in sequence.

405

 

Establishing a connection to SQL Server

We use the SQLAlchemy create_engine method to create an engine object that allows us to interact with the database e.g. the interaction could be as basic as establishing a query connection.

406

By adding an echo parameter to the create_engine method, we get more connection info.

407

Only turn echo on if you are troubleshooting, otherwise its just a lot of noise.

Loading SQL results to a dataframe

This is the easy part. We have a choice between read_sql, read_sql_query and read_sql_table. Lets start with the table.

408

The schema (defaults to dbo) and columns (defaults to all columns) parameters are optional. There are quite a few other parameters. Have a looking at the documentation.

To load results from a SQL query is only slightly more complex. We will make use of the multi-line string again for readability and maintenance.

409

Connecting to Azure SQL DB

Hopefully by now you realize the benefits of using Azure SQL DB. Our company uses Azure SQL DB as a data management solution for Power BI. Yes, there is huge benefit in persisting and managing your data outside of Power BI! Perhaps I’ll cover this Power BI solution architecture in a future post. (hint … have a look at dataframe.to_sql)

Azure SQL DB makes use of the latest SQL Drivers. SQL Alchemy will moan if you try to adapt the connection string we have above which uses the standard SQL driver. Also note the additional settings such as encryption and trust certificates.

410

The rest works the same as the local SQL instance once you have created the engine.

Conclusion

By now you know that once your data is in a dataframe, The Python script in the Power Query editor will recognize your data.

Of course it is much easier to do this in Power Query without a line of code. But you may have scenarios where you perform data cleansing, transformations, combining, transposing, etc on the dataframe before loading your data into Power BI.

The key idea that drew my interest in the first place was the ability to re-use code across platforms and applications.

SQLSaturday #793 Cape Town

I’ll be speaking about Python integration in Power BI at SQLSaturday #793 Cape Town this coming Saturday, 8 September. If you are in Cape Town, come and join us. There are many great sessions to pick from.

 

Posted in Power BI, Python

Load files in a folder using Python

This is part 3 of the Basics of Python in Power BI series and builds on from the previous post about Python dataframes. In this post we show you how to import files from a folder using Python. Combining files from a folder is currently an easy process in Power Query, but I mentioned in my first post that I am interested in the re-use of existing Python scripts.

For readability, I am switching between Spyder and JupyterLab, depending on what I need to show. I’ll also cover some foundation concepts so that an understanding is gained, rather than just an answer. Three concepts are covered in this post: working with files using the operating system (os) package, working with lists and a bit more dataframe functionality.

Working with files in Python

Python has a built-in os (operating system) library. This gives us access to methods such as getcwd (get current working directory), chdir (change directory) and listdir (list directory).

b2py1

While getcwd returns the current working directory (or current path), chdir changes the path to the newly specified one and this becomes the current working directory. When we use listdir with a path, it only returns the list of files in that path, but does not change the current working directory.

Lists

The listdir method returns the result as a Python list data structure. The outputs 4 and 5 above can be identified as lists because they are enclosed in square brackets and items are comma separated. The items in the lists in these results are all of data type string. When lists have the same data types the items are referred to as homogeneous. A list can even hold items of a mix of data types (heterogeneous).

The items in a list are indexed starting at 0. We can either find the content by providing an index, or we can find the index of the first occurrence of  an item.

b210

We are even able to change the content (mutable) of the list item by specifying the index, as seen above (that index position must already exist).

We can iterate over the items in a list using a for loop construct.

b212

The syntax is equivalent to saying “for i = 0 to 3 multiply each item by 2”.

Since listdir returns a list,  we may loop through the list to display each file name.

 

b213

More dataframes

We are going to loop through the files as above and add each file to a dataframe. We use the dataframe head method display the first n rows.

 

b214

Our first attempt unfortunately keeps overwriting the previous loop iteration so we end up with the last file in the list only. To solve the problem, we are going to create an empty list, then append each dataframe as a list item. Note that the items are of type dataframe.

 

b215

Our final script above has one critical step and a few desired additions. In the second last line we concatenate all the dataframe items in the list into a single dataframe. Pandas ensures that headers are not duplicated. In addition, we parsed dates & data types and added a source file column.

 

Power BI Query Editor

For the curious folk, lets take a look at the M code. I stripped comments from the final script to eliminate distractions.

b216

 

 

The first thing you will notice as that df_one and df_all are available for selection.

b217

Instead of loading the query, select Edit to open the Power Query Editor window.

 

b218

Now open the Advanced Editor to have a look at the M script. In the Source step we can see the Python.Execute command which allows us to run Python script.

 

b219

I am still deciding if I am happy or disappointed to discover that Python.Execute accepts a single string parameter which contains the entire Python script. At least I was happy that it was easy to edit. I made a dummy (but valid) change to see what happens.

 

At this point I was prompted about execute permissions which I suppose makes sense as there is really no control over the content of the Python script.

b220

b221

 

When you click on Run the Python script is executed again. Click on Close & Apply will take you back to Power BI as usual.

That is it for this post. I’ll continue with Python script for the Query Editor as I see that there is more focus on the Py visuals currently.

 

 

 

Posted in Power BI, Python

Basics of Python in Power BI: Part 2

The is a follow-on post to the Basics of Python in Power BI post. The idea behind the first post was to illustrate the most basic of Python integration in Power BI by keeping the Python code to a bare minimum. In this post I want to try and cover the basics of Pandas dataframes as this appears to be the primary way for Power BI to recognize data delivered via a Python script. Keep in mind that when I say “basics”, I am referring to dataframes in relation to Power BI. Also the results are from a discovery process with the preview, not from any Power BI reference material.

As a data science tool, the Python fundamentals to really master are data structures. DataCamp has a great post on the basics of data structures. In my brief journey with Python thus far, the built-in data structures I have needed to get to grips with are lists and dictionaries. I’ll cover the basics of lists and dictionaries in future posts. For now we will focus on Pandas dataframes, not just because it is supported in Power BI, but because the data science universe says so!

Dataframes

Pandas is a very popular Python package for data science because of its ability to manipulate and analyze data. This is possible because of data structures such as dataframes. The good news for Microsoft data platform folk is that Pandas shares many similarities with SQL, so much so that the Pandas documentation includes a comparison with SQL.

DataCamp defines a dataframe as “two-dimensional labeled data structures with columns of potentially different types” and “you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns”. For now, in order to get to grips with dataframes, lets think of it as a table with an index.

Recall how we read the CSV file into a Pandas dataframe in the previous post:

b302

The original file had 2 columns. When we used the read_csv method to add the file content to the dataframe, the index was automatically added.

Note: throughout the post I am using the head(5) method to display the top 5 records. You don’t need this in Power BI. In fact Power BI ignores this as it does not modify the dataframe.

Mind the index

When we use the Python script in Power BI, it does not display the index.

b303

If we have our own index column in the data e.g. ValueDate, we could specify the index column as a parameter in the read_csv method as follows:

b304

As we have noted, Power BI does not display the index … even if it is one of the data columns!

b305

It may be of interest that the to_sql method includes the index column when loading to SQL. I’ll blog about this in a future post as a reference for those who want to persist their data to SQL.

Watch the data types

While I am unable to recreate the scenario, I had a case where Load vs Edit displayed different behavior in the Query Editor. In this scenario, Edit did not show the Change Type step. In this case I manually changed the data types to date and decimal.

I tried to overcome this by forcing the data types to be set during the read_csv method by parsing the date and setting the data type for TheValue as float. Data types for read_csv can only be set to string, float or int (integer). If we compare the raw date vs parsed date, we will see that the raw date is treated as a string.

b310

Although TheValue is correctly detected as float, we can ensure that the column is cast as float by adding the dtype parameter to the read_csv method.

b311

It appears that Power BI still applies a Change Type step regardless of this.

b307

This is not really an issue, but it is useful to know that the read_csv method can handle data type changes including decimals with a comma or dates in other formats. Perhaps this will change in Power BI in the future.

Sort for compression

I am not the expert about the technicalities of compression, but I read a blog post by Matt Allington discussing the impact of unsorted data on large data models. Since its easy to sort data with a dataframe, we may as well apply this to our code to avoid potential pitfalls.

b308

Or we could write it like this:

b309

Summary

Dataframes are powerful data structures. We have scratched the surface by touching on specifics that may affect Power BI. This includes dataframe indexes, dealing with data types and sorting data.

For more info on handling CSV files (headers, skipping rows, missing values, etc) I’ll have to send you to DataCamp.

Disclaimer: I am in no way affiliated to DataCamp, but find the content incredibly useful for getting started with Python.

 

 

 

 

 

Posted in Power BI, Python