☰
Python Across Disciplines
with Python + AI Tool   
×
Table of Contents

1.1.   Introduction 1.2.   About the Author & Contact Info 1.3.   Book Conventions 1.4.   What (Who) is a Programmer? 1.5.   Programming Across Disciplines 1.6.   Foundational Computing Concepts 1.7.   About Python 1.8.   First Steps 1.8.1 Computer Setup 1.8.2 Python print() Function 1.8.3 Comments
2.1. About Data 2.2. Data Types 2.3. Variables 2.4. User Input 2.5. Data Structures (DS)         2.5.1. DS Concepts         2.5.2. Lists         2.5.3. Dictionaries         2.5.4. Others 2.6. Files         2.6.1. Files & File Systems         2.6.2. Python File Object         2.6.3. Data Files 2.7. Databases
3.1. About Processing 3.2. Decisions         3.2.1 Decision Concepts         3.2.2 Conditions & Booleans         3.2.3 if Statements         3.2.4 if-else Statements         3.2.5 if-elif-else Statements         3.2.6 In-Line if Statements 3.3. Repetition (a.k.a. Loops)         3.3.1  Repetition Concepts         3.3.2  while Loops         3.3.3  for Loops         3.3.4  Nested Loops         3.3.5  Validating User Input 3.4. Functions         3.4.1  Function Concepts         3.4.2  Built-In Functions         3.4.3  Programmer Defined Functions 3.5. Libraries         3.5.1  Library Concepts         3.5.2  Standard Library         3.5.3  External Libraries 3.6. Processing Case Studies         3.6.1  Case Studies         3.6.2  Parsing Data
4.1. About Output 4.2. Advanced Printing 4.3. Data Visualization   4.4  Sound
  4.5  Graphics
  4.6  Video
  4.7  Web Output
  4.8  PDFs & Documents
  4.9  Dashboards
  4.10  Animation & Games
  4.11  Text to Speech

5.1 About Disciplines 5.2 Accounting 5.3 Architecture 5.4 Art 5.5 Artificial Intelligence (AI) 5.6 Autonomous Vehicles 5.7 Bioinformatics 5.8 Biology 5.9 Bitcoin 5.10 Blockchain 5.11 Business 5.12 Business Analytics 5.13 Chemistry 5.14 Communication 5.15 Computational Photography 5.16 Computer Science 5.17 Creative Writing 5.18 Cryptocurrency 5.19 Cultural Studies 5.20 Data Analytics 5.21 Data Engineering 5.22 Data Science 5.23 Data Visualization 5.24 Drone Piloting 5.25 Economics 5.26 Education 5.27 Engineering 5.28 English 5.29 Entrepreneurship 5.30 Environmental Studies 5.31 Exercise Science 5.32 Film 5.33 Finance 5.34 Gaming 5.35 Gender Studies 5.36 Genetics 5.37 Geography 5.38 Geology 5.39 Geospatial Analysis ☯ 5.40 History 5.41 Humanities 5.42 Information Systems 5.43 Languages 5.44 Law 5.45 Linguistics 5.46 Literature 5.47 Machine Learning 5.48 Management 5.49 Marketing 5.50 Mathematics 5.51 Medicine 5.52 Military 5.53 Model Railroading 5.54 Music 5.55 Natural Language Processing (NLP) 5.56 Network Analysis 5.57 Neural Networks 5.58 Neurology 5.59 Nursing 5.60 Pharmacology 5.61 Philosophy 5.62 Physiology 5.63 Politics 5.64 Psychiatry 5.65 Psychology 5.66 Real Estate 5.67 Recreation 5.68 Remote Control (RC) Vehicles 5.69 Rhetoric 5.70 Science 5.71 Sociology 5.72 Sports 5.73 Stock Trading 5.74 Text Mining 5.75 Weather 5.76 Writing
6.1. Databases         6.1.1 Overview of Databases         6.1.2 SQLite Databases         6.1.3 Querying a SQLite Database         6.1.4 CRUD Operations with SQLite         6.1.5 Connecting to Other Databases
Built-In Functions Conceptss Data Types Date & Time Format Codes Dictionary Methods Escape Sequences File Access Modes File Object Methods Python Keywords List Methods Operators Set Methods String Methods Tuple Methods Glossary Index Appendices   Software Install & Setup
  Coding Tools:
  A.  Python    B.  Google CoLaboratory    C.  Visual Studio Code    D.  PyCharm IDE    E.  Git    F.  GitHub 
  Database Tools:
  G.  SQLite Database    H.  MySQL Database 


Python Across Disciplines
by John Gordon © 2023

Table of Contents

Table of Contents  »  Chapter 3 : Processing : Repetition : Parsing Data

Parsing Data

Subscribe Contact


Contents

Overview

In Chapter 2, I introduced numerous concepts about data and ways in which we might acquire data to process in our programs. String data in the forms of characters, words, phrases, sentences, paragraphs, etc. are very common in computing and often need to be processed in many different ways. Examples of processing data include counting characters or words, locating instances of particular words or phrases, extracting key words, names, places, dates (often called entitied), validating important information in textual data, and many others. On this page we will explore parsing strings of several types.

What is Parsing?

While we know how to read the entire contents of a file into a variable and print it, we often need to analyze the contents of a file more closely than its entire contents. In order to examine long strings more closely, we use a technique called parsing.

Concept: Parsing
Full Concepts List: Alphabetical  or By Chapter 

Parsing strings in Python involves analyzing a string's structure and extracting specific data from it according to a predefined pattern or structure. This process is essential in various applications, such as data analysis, web scraping, and configuration file management. Python offers multiple methods and libraries to facilitate string parsing, including built-in functions like split() for dividing a string into a list of substrings (called tokens) based on a delimiter, and strip() for trimming whitespace. For more complex parsing needs, Python provides the re module, which supports regular expressions allowing for sophisticated searching, matching, and manipulation of string patterns. This capability enables developers to extract specific pieces of information from strings, validate string formats, or transform strings in powerful ways, adapting to the diverse needs of different programming scenarios.

Here is a visual representation of parsing a string for the purposes of counting the number of words in the string:



Concept: Token
Full Concepts List: Alphabetical  or By Chapter 

Tokens in the context of string data refer to the individual components or pieces that result from dividing a string based on specific criteria, such as delimiters. Tokenization is the process of splitting a string into these smaller parts or tokens, which can be words, numbers, or symbols, depending on the content of the string and the rules applied for splitting. This concept is fundamental in text processing, parsing, and natural language processing (NLP) tasks, where analyzing and understanding the textual data at a granular level is essential. For example, when processing a sentence, tokenization might involve splitting the sentence into individual words or punctuation marks. Python's split() method on strings is a straightforward way to tokenize a string based on whitespace or other specified delimiters. Additionally, libraries like NLTK (Natural Language Toolkit) provide more sophisticated tools for tokenization, capable of handling complex patterns, such as separating contractions or distinguishing punctuation. Tokenization is a crucial first step in preparing text for further analysis, such as counting word frequencies, performing sentiment analysis, or building machine learning models for text classification.

Here is a visual representation of parsing a string for the purposes of counting the number of words in the string:


Dealing with Punctuation

One of the first decisions we need to make when parsing string data is what to do about punctuation. In some cases, we'll want to preserve punctuation during parsing, but mostly commonly we'll remove punctuation so that our parsed tokens are independent of punctuation that existed in the original string. There are a number of approaches we can use to remove punctuation from a string. For now, we'll use repetition to do this, and then later we'll learn more efficient approaches.

Let's use the To be or not to be, that is the question. string example above for an example. In this string, there are two punctuation characters, the comma and the ending period. We could simply use the string replace() method to remove the two punctuation characters, like this:

s = "To be or not to be, that is the question."
print(s)
s = s.replace(",","")
print(s)
s = s.replace(".","")
print(s)

Output

To be or not to be, that is the question.
To be or not to be that is the question.
To be or not to be that is the question

While this works, it's really restricted to the one string example we have right now. What about other strings with different punctuation in them?

A bit more of a generic example would be to use a list of punctation characters we want to eliminate from any string through the use of comparison and repetition (loop) to remove any of the punctuation characters we don't want in our strings. Here's an example:

shake_string = "To be or not to be, that is the question."
out_string = ""
punctuation = ["!", "\"", "#", "$", "%", "&", "\'",
               "(", ")", "*", "+", ",", "-", ".",
               "/", ":", ";", "<", "=", ">", "?",
               "@", "[", "\\", "]", "^", "_", "{",
               "|", "}", "~", "`", "."]
for s in shake_string:
  if s not in punctuation:
    out_string += s
print(shake_string)
print(out_string)

Output

To be or not to be, that is the question.
To be or not to be that is the question

Code Details

Resolving Case Issues

Depending on the reason we are parsing strings, in addition to dealing with punctuation, we may also need to consider whether to maintain the case of the objects in the original string or to make them all the same. In some circumstances, like the need to preserve capitalization of nouns, it is important to maintain the original case. In those circumstances, we would not need to do anything to the string after removing punctuation. However, other circumstances, like the need to be able to easily compare the ojects in the original string with other objects, it may be best to set the entire string to the same case (upper or lower).

For our purposes here, we will set the case to all lower case as we will be comparing objects from our original string with search words and it will be simpler to do so if the cases are all the same. The following code is identical to the code in the punctuation removal section above, other than the addition of the lower() method to the variable s on Code Line 10.

shake_string = "To be or not to be, that is the question."
out_string = ""
punctuation = ["!", "\"", "#", "$", "%", "&", "\'",
               "(", ")", "*", "+", ",", "-", ".",
               "/", ":", ";", "<", "=", ">", "?",
               "@", "[", "\\", "]", "^", "_", "{",
               "|", "}", "~", "`", "."]
for s in shake_string:
  if s not in punctuation:
    out_string += s.lower()    # << This is the change to lower case the output string.
print(shake_string)
print(out_string)

The decision to change case or not is often indicated in data processing requirements or becomes obvious as we develop our solutions.

Parsing Strings

Now that we have learned the fundamentals of dealing with punctuation, now we'll learn to parse strings. The primary string method we use to parse strings is the split() method. This method divides a string into tokens (substrings) based on a specified delimiter (separator) and stores the tokens in a list. By default, the delimiter is any kind of space (spaces, newlines, tabs, etc.), otherwise split() will divide the string using a separator that we specify. The result of using the split() method is a Python list containing the divided (parsed) parts (tokens) of the original string.

Concept: Delimiter
Full Concepts List: Alphabetical  or By Chapter 

A delimiter is a character or sequence of characters used to specify the boundary between separate, independent regions in text or data streams. Delimiters play a crucial role in string data processing, particularly in parsing, where they help to split a string into components or tokens based on specific boundaries. Common examples of delimiters include commas (,), semicolons (;), spaces ( ), and newlines (\n). Python provides various methods to work with delimited string data, such as the split() method of string objects, which splits a string into a list of tokens based on a specified delimiter. For instance, using a comma as a delimiter, a CSV (Comma-Separated Values) string can be parsed into individual components. The choice of delimiter is crucial for accurately interpreting the intended structure of the data, and it often depends on the format and specifications of the input data being processed.

Here are a few examples of using split() with the default delimiter and using a specified delimiter:

Example 1

# Using split() with the default delimiter (that is, none specified)
txt = "This is a string."
lst = txt.split()
print(lst)

Output

['This', 'is', 'a', 'string.']

Code & Output Details

Example 2

If we use the same code again, but this time include a specific delimiter, such as the letter s, we can see different results:

# Using split() and specifying a delimiter
txt = "This is a string."
lst = txt.split('s')
print(lst)

Output

['Thi', ' i', ' a ', 'tring.']

Code & Output Details

Example 3

Another (more common) example of using split() with a delimiter is comma-separated data where there are words or strings separated by commas, like this:

# Using split() and specifying comma as the delimiter
txt = "Bob Smith, 1234 Nowhere Lane, Someplace, UT, 84999"
lst = txt.split(',')
print(lst)

Output

['Bob Smith', ' 1234 Nowhere Lane', ' Someplace', ' UT', ' 84999']

Code & Output Details

Use Cases

Once we have parsed a string, and have a list of the tokens, we can use those tokens for various purposes. Here are some examples:

Use Case: Counting Words

A common task to perform with a parsed string is to count the number of words in the string. As with most things in programming, there's more than one way to count words in a string. Since we are focued on repetition structures in this chapter, we'll look at an approach using repetition for now, and will learn more approaches as we proceed. Since the tokens of a string are stored in a list as a result of using the split() method, counting the number of words in a string is actually counting tokens in a list. Given this, we can simply iterate through the list using the for in construct and use an accumulator variable to count the number of iterations through the list.

Here's an example:

# For the sake of simplicity and focus, we will assume that punctuation
# has already been removed from the string as described above.
txt = "to be or not to be that is the question"
accumulator = 0
lst = txt.split()
for tok in lst:
    accumulator += 1
print("Number of words (tokens) in the string:", accumulator)

Output

Number of words (tokens) in the string: 10

Code & Output Details

Alternative Approach Using a Python Function

Because the split() method stores the string's tokens in a list, there is a much simpler approach of counting the number of tokens in the list without the need for looping at all, that is, using the len() function. Here's an example:

txt = "To be or not to be that is the question"
lst = txt.split()
print("Number of words:", len(lst))

Note that in this alternative code, there is no need for the accumulator or the loop, in the print statement we can simply use the len() function instead. Since we are focused on repetition however, we will use repetition in the following use cases as well.


Use Case: Counting Specific Words

In this use case, we want to count the number of times a particular word (token) is in a string. The code to do this is very similar to the code we wrote in the previous example use case for counting words. The key difference is that in the code block inside of the loop, we'll need a conditional decision to determine if each word in the list matches the specified search word. When it matches, we'll increment our accumulator. When the token does not match, then we'll skip to the next token. Here's the code to accomplish this:

txt = "To be or not to be that is the question"
search_word = "be"
accumulator = 0
lst = txt.split()
for tok in lst:
  if tok == search_word:
    accumulator += 1
print("Number of occurrences of the word '" + search_word + "' is", accumulator, end=".")

Output

Number of occurrences of the word 'be' is 2.

Code & Output Details


Use Case: Reversing a String by Word

Another use case that will add to your understanding of how to process strings that have been split() is reversing the string by word. Reversing strings usually means to reverse all of the characters, last character first to first character last. For example, if we start with the string to be or not to be and reverse it, it would turn out to be eb ot ton ro eb ot. This is reversing the order of all characters, however we want to reverse the order of the words in the string to read like be to not or be to instead. Since the split() method loads a list with the parsed tokens of the original string, there are a couple of ways in which we can reverse the word order.

Since we are focused on repetition in this chapter, we'll use a loop to reverse the order of the tokens in a list. Here's an example:

txt = "to be or not to be that is the question"
lst = txt.split()
rev_lst = []
for tok in lst:
    rev_lst.insert(0, tok)
print("Original List: ", lst)
print("Reversed List: ", rev_lst)

Output

Original List:  ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
Reversed List:  ['question', 'the', 'is', 'that', 'be', 'to', 'not', 'or', 'be', 'to']

Code & Output Details

Alternative Approach Using a List Method

In Python, the list data structure includes a method for reversing the list. Here's an example:

txt = "To be or not to be that is the question"
lst.reverse()
print("Reversed List:", lst)

Note that in this alternative code, there is no need for a loop, we can simply use the list reverse() method. One detail to note that may be important depending on the application--the reverse() method reverses the original list's order. So, if the original order needs to be preserved, this approach of using reverse() on the original list may not be the best choice.

Using Repetition & Parsing with Data Files

As introduced in Chapter 2 , Data files are common resources we use to store and transmit data. There are many types of files we could engage with as programmers. For now, we will work with plain text files (files with a file extension of .txt) and comma-separate value files (files with a file extension of .csv). These two types of files are very common and learning to work with the programmatically is an important skill.

TXT Files

Text files (often referred to as plain text files) usually contain unstructured data. In Chapter 2  we saw that it is easy to create, read, and write the contents of a text file. In the example found there, we used the following syntax to create a file, write a word to it, and then read its contents and printed the contents:

# Writing to a text file
with open('example.txt', 'w') as file:
    file.write("Hello!\n")

# Reading from a text file
with open('example.txt', 'r') as file:
    content = file.read()
    print(content)

Writing larger amounts of text to a file is just as simple:

txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum."""

with open('example.txt', 'w') as file:
    file.write(txt)

with open('example.txt', 'r') as file:
    content = file.read()
    print(content)

Code Details:

CSV Files



 





© 2023 John Gordon
Cascade Street Publishing, LLC