☰
Python Across Disciplines
with Python + AI Tool   
×
Table of Contents

1.1.   Introduction 1.2.   About the Author & Contact Info 1.3.   Book Conventions 1.4.   What (Who) is a Programmer? 1.5.   Programming Across Disciplines 1.6.   Foundational Computing Concepts 1.7.   About Python 1.8.   First Steps 1.8.1 Computer Setup 1.8.2 Python print() Function 1.8.3 Comments
2.1. About Data 2.2. Data Types 2.3. Variables 2.4. User Input 2.5. Data Structures (DS)         2.5.1. DS Concepts         2.5.2. Lists         2.5.3. Dictionaries         2.5.4. Others 2.6. Files         2.6.1. Files & File Systems         2.6.2. Python File Object         2.6.3. Data Files 2.7. Databases
3.1. About Processing 3.2. Decisions         3.2.1 Decision Concepts         3.2.2 Conditions & Booleans         3.2.3 if Statements         3.2.4 if-else Statements         3.2.5 if-elif-else Statements         3.2.6 In-Line if Statements 3.3. Repetition (a.k.a. Loops)         3.3.1  Repetition Concepts         3.3.2  while Loops         3.3.3  for Loops         3.3.4  Nested Loops         3.3.5  Validating User Input 3.4. Functions         3.4.1  Function Concepts         3.4.2  Built-In Functions         3.4.3  Programmer Defined Functions 3.5. Libraries         3.5.1  Library Concepts         3.5.2  Standard Library         3.5.3  External Libraries 3.6. Processing Case Studies         3.6.1  Case Studies         3.6.2  Parsing Data
4.1. About Output 4.2. Advanced Printing 4.3. Data Visualization   4.4  Sound
  4.5  Graphics
  4.6  Video
  4.7  Web Output
  4.8  PDFs & Documents
  4.9  Dashboards
  4.10  Animation & Games
  4.11  Text to Speech

5.1 About Disciplines 5.2 Accounting 5.3 Architecture 5.4 Art 5.5 Artificial Intelligence (AI) 5.6 Autonomous Vehicles 5.7 Bioinformatics 5.8 Biology 5.9 Bitcoin 5.10 Blockchain 5.11 Business 5.12 Business Analytics 5.13 Chemistry 5.14 Communication 5.15 Computational Photography 5.16 Computer Science 5.17 Creative Writing 5.18 Cryptocurrency 5.19 Cultural Studies 5.20 Data Analytics 5.21 Data Engineering 5.22 Data Science 5.23 Data Visualization 5.24 Drone Piloting 5.25 Economics 5.26 Education 5.27 Engineering 5.28 English 5.29 Entrepreneurship 5.30 Environmental Studies 5.31 Exercise Science 5.32 Film 5.33 Finance 5.34 Gaming 5.35 Gender Studies 5.36 Genetics 5.37 Geography 5.38 Geology 5.39 Geospatial Analysis ☯ 5.40 History 5.41 Humanities 5.42 Information Systems 5.43 Languages 5.44 Law 5.45 Linguistics 5.46 Literature 5.47 Machine Learning 5.48 Management 5.49 Marketing 5.50 Mathematics 5.51 Medicine 5.52 Military 5.53 Model Railroading 5.54 Music 5.55 Natural Language Processing (NLP) 5.56 Network Analysis 5.57 Neural Networks 5.58 Neurology 5.59 Nursing 5.60 Pharmacology 5.61 Philosophy 5.62 Physiology 5.63 Politics 5.64 Psychiatry 5.65 Psychology 5.66 Real Estate 5.67 Recreation 5.68 Remote Control (RC) Vehicles 5.69 Rhetoric 5.70 Science 5.71 Sociology 5.72 Sports 5.73 Stock Trading 5.74 Text Mining 5.75 Weather 5.76 Writing
6.1. Databases         6.1.1 Overview of Databases         6.1.2 SQLite Databases         6.1.3 Querying a SQLite Database         6.1.4 CRUD Operations with SQLite         6.1.5 Connecting to Other Databases
Built-In Functions Conceptss Data Types Date & Time Format Codes Dictionary Methods Escape Sequences File Access Modes File Object Methods Python Keywords List Methods Operators Set Methods String Methods Tuple Methods Glossary Index Appendices   Software Install & Setup
  Coding Tools:
  A.  Python    B.  Google CoLaboratory    C.  Visual Studio Code    D.  PyCharm IDE    E.  Git    F.  GitHub 
  Database Tools:
  G.  SQLite Database    H.  MySQL Database 


Python Across Disciplines
by John Gordon © 2023

Table of Contents

Table of Contents  »  Chapter 3 : Processing : Case Studies : Datasets

Datasets

Contents

Overview

Datasets are structured collections of data, commonly used in various fields such as science, finance, and information technology for analysis, machine learning, and statistical modeling. Typically organized in a tabular format with rows and columns, datasets can range from simple spreadsheets to complex databases, where each row represents an individual record or observation and each column represents a specific variable or attribute of that record. For example, a dataset for a weather study might include columns for date, temperature, humidity, and wind speed, with each row containing the measurements for a specific day. Datasets can be sourced from various origins, including publicly available datasets on the internet, proprietary data collected by organizations, or generated through simulations.

Concept: Dataset
Full Concepts List: Alphabetical  or By Chapter 

In programming, a dataset is a collection of data that is typically organized and structured in some manner. It serves as the raw material from which information can be extracted, processed, analyzed, and interpreted. Datasets can come in various forms and sizes, ranging from simple arrays of numbers to complex, multi-dimensional structures containing diverse types of data. They can be sourced from various places like databases, file systems, APIs, or generated through simulations and experiments. In the context of data analysis, machine learning, or scientific research, a dataset is crucial as it forms the basis upon which algorithms operate and insights are drawn. The quality, relevance, and integrity of a dataset greatly influence the outcomes of any computational process applied to it. Good dataset management practices involve cleaning, organizing, and sometimes transforming data to make it suitable for specific tasks. The representation and handling of datasets are often facilitated by programming libraries and tools tailored to the specific needs of the data and the objectives of the project.

Datasets can range from small, simple collections used for basic tasks to vast, complex aggregations of information employed in advanced computational analyses. The significance of datasets extends beyond mere data storage; they are invaluable for training algorithms, uncovering insights through data analysis, and driving decision-making processes in various fields.

Many datasets are publicly available, offering a rich resource for programmers and data scientists. These are found on platforms like Kaggle , which hosts datasets ranging from historical election results to the latest trends in video gaming, or the UCI Machine Learning Repository, offering datasets for experimental purposes in machine learning, such as the famous Iris dataset or the Wine Quality dataset. Government websites also provide access to public data, like demographic information and economic indicators.

With its robust libraries and tools, Python excels in importing, processing, and analyzing these data sets, allowing you to glean meaningful information from them. This section will guide you through the fundamentals of working with data sets in Python, illustrating how to access, manipulate, and extract value from these treasure troves of information, thereby unlocking a world of possibilities in data-driven programming.

Example Dataset

Let's look at an example dataset and walk through how we might process it. The example dataset below contains the high and low temperatures for each day of November 2023 in Salt Lake City, Utah. Notice that there is header row that lable each attribute (column). Also, there is one record (row) for each day of the month. Each record contains the date, the high temperature and the low temperature for that day.

Date,Hi,Lo
2023-11-01,58,32
2023-11-02,64,35
2023-11-03,67,44
2023-11-04,67,41
2023-11-05,65,45
2023-11-06,71,48
2023-11-07,52,38
2023-11-08,49,34
2023-11-09,49,31
2023-11-10,53,31
2023-11-11,54,31
2023-11-12,63,33
2023-11-13,66,42
2023-11-14,66,38
2023-11-15,68,44
2023-11-16,59,41
2023-11-17,58,36
2023-11-18,52,39
2023-11-19,49,35
2023-11-20,46,34
2023-11-21,48,30
2023-11-22,49,30
2023-11-23,42,32
2023-11-24,36,32
2023-11-25,38,29
2023-11-26,36,29
2023-11-27,40,25
2023-11-28,41,24
2023-11-29,36,25
2023-11-30,34,30

Data Details

What kinds of information, summaries, or aggregations could we create using this dataset?

  1. Calculate the average temperature for each day.
  2. Determine the date that had the highest temperature of the month.
  3. Determine the date the had the lowest temperature of the month.
  4. Calculate the average temperature of the entire month.
  5. Create a graph that visually depicts the temperature of all of the days of the month.

Let's walk through a solution that handles all five of these tasks. First, let's set up the data in a Python data structure. There are some different options, but for this example I will format the data and create a list of lists directly in my code file. In a production system, we would not include our data directly in our code file, but for demonstration, it's easier to see everything (data and code) together. In a production system, the data would be commonly stored in a file or a database table. For now, though, I'll keep it all together in my code file (I'll show a CSV file-based example for comparison below).

Do you remember how to create a list of lists? Here's the general form we need:

list = [[List 1], [List 2], [List 3], ..., [List n]]

In the case of our dataset above, we would form a list of each record (row). So, for example, we would reformat the first record of the dataset:

2023-11-01,58,32

as

["2023-11-01", 58, 32]

And then if we format all 30 of the records in the data set as a list, separate each list with a comma, and place the entire set of list inside the square brackets of a list, it would looke like this:

# Dataset formatted as a list of lists
data = 
[
  ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
  ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
  ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
  ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
  ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
  ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
  ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
  ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
  ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
  ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]

A list of lists like this will make it reasonably easy to loop through the dataset to process each record.

Next, I'll write the code to create a few variables to use during processing, loop through the list of lists to calculate the daily average temperatures, append each day to a processed queue (list), and also keep a running total temperature and record count that I'll use at the end to calculate the average temperature for the month. After the calculation loop, I'll print the results.

data = [
    ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
    ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
    ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
    ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
    ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
    ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
    ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
    ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
    ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
    ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]

processed_data = []
total_temp = 0
count = 0
for record in data:
    date, hi, lo = record
    avg_temp = (hi + lo) / 2
    processed_data.append([date, hi, lo, avg_temp])  # Append as list
    total_temp += avg_temp
    count += 1
print()
print("Salt Lake City\nNovember 2023 Temperature Chart")
print("-" * 40)
print("Date".ljust(15) + "Low".ljust(10) + "High".ljust(10))
print("-" * 40)
for row in data:
  print(row[0].ljust(15), str(row[1]).ljust(10), str(row[2]).ljust(10))
print("-" * 40)
monthly_avg_temp = total_temp / count
print("Average Temperature: ", "{:.2f}".format(monthly_avg_temp), "Degrees")
print("-" * 40)

Output

Salt Lake City
November 2023 Temperature Chart
----------------------------------------
Date           Low       High      
----------------------------------------
2023-11-01      58         32        
2023-11-02      64         35        
2023-11-03      67         44        
2023-11-04      67         41        
2023-11-05      65         45        
2023-11-06      71         48        
2023-11-07      52         38        
2023-11-08      49         34        
2023-11-09      49         31        
2023-11-10      53         31        
2023-11-11      54         31        
2023-11-12      63         33        
2023-11-13      66         42        
2023-11-14      66         38        
2023-11-15      68         44        
2023-11-16      59         41        
2023-11-17      58         36        
2023-11-18      52         39        
2023-11-19      49         35        
2023-11-20      46         34        
2023-11-21      48         30        
2023-11-22      49         30        
2023-11-23      42         32        
2023-11-24      36         32        
2023-11-25      38         29        
2023-11-26      36         29        
2023-11-27      40         25        
2023-11-28      41         24        
2023-11-29      36         25        
2023-11-30      34         30        
----------------------------------------
Average Temperature:  43.57 Degrees
----------------------------------------

Code & Output Details


Improving the Output of the Processed Dataset

As indicated in the Code Details above, printing the processed_data list with no formatting produces output unsuitable for users to read. Since it is a list of lists, how might we improve the output?

One approach would be to iterate through the processed_data list and print one line for each day that displays the date and average temperature for the date, and also print a set of text characters that visually represent the average temperature as well. Study the following code, which is a copy of the completed code above, with Code Line 31 (above) replaced by Code Lines 32 through 36 in the new version of the code below. See the Code Details below for a description of this change.

data = [
    ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
    ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
    ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
    ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
    ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
    ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
    ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
    ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
    ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
    ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]
processed_data = []
total_temp = 0
count = 0

for record in data:
    date, hi, lo = record
    avg_temp = (hi + lo) / 2
    processed_data.append([date, hi, lo, avg_temp])
    total_temp += avg_temp
    count += 1
monthly_avg_temp = total_temp / count
print("-" * 90)
print("Salt Lake City - November 2023 - Temperature Chart")
print("-" * 90)
for record in processed_data:
    date, _, _, avg_temp = record  # Unpack each list
    hash_graph = '\u2584' * int(avg_temp)
    print(f"{date}: {avg_temp}°F - {hash_graph}")
print("-" * 90)
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))
print("-" * 90)

Output

------------------------------------------------------------------------------------------
Salt Lake City - November 2023 - Temperature Chart
------------------------------------------------------------------------------------------
2023-11-01: 45.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-02: 49.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-03: 55.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-04: 54.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-05: 55.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-06: 59.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-07: 45.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-08: 41.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-09: 40.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-10: 42.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-11: 42.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-12: 48.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-13: 54.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-14: 52.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-15: 56.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-16: 50.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-17: 47.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-18: 45.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-19: 42.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-20: 40.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-21: 39.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-22: 39.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-23: 37.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-24: 34.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-25: 33.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-26: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-27: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-28: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-29: 30.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-30: 32.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
------------------------------------------------------------------------------------------
Average Temperature during November 2023:  43.57
------------------------------------------------------------------------------------------


Note: While the above textual graph is an improvement over simply printing the Python list of lists, it is more common that data visualizations are produced in a more graphical (colors, shapes, images, etc.) manner. We will use this example dataset above to create a more visual version of similar data when we look at Data Visualization later in the book.

Publicly Available Datasets

Many organizations and websites make datasets available to the public. Here is a short list of a few of them to get you started exploring different data types available in datasets.

Public datasets are excellent resources for learning, research, and analysis. Most of the datasets available at the above locations are free to use for educational and non-commercial use. Some of these public datasets are small, others are huge. I recommend that you visit some of these sites and explore the types of data available to the public.



 





© 2023 John Gordon
Cascade Street Publishing, LLC