text cleaning

I was requested to clean up a really messy text file. It’s a word file with names, addresses, phone numbers, schools, spouses dates, and random notes in no consistent order, a lot of missing information, and duplicates!

Initialization and read in code

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
from commonregex import CommonRegex
parser = CommonRegex()
names=[[]]
with open('../input/regionalliststxt.txt') as fp:
text = fp.read(12000)
print('** READ in TEXT: **\n')
for region in text.split('!!Region'):
regname=region[0:50].split('!!')[1]
print(regname)
pplinReg=region[50:]
split=pplinReg.split('%')
print(len(split),type(split),split[3])
for i in split:
names.append([regname,i])
df = pd.DataFrame(names, columns=['Region','PersonEntry'])

Now lets clean the entries:


df['PersonEntry']=df['PersonEntry'].str.replace(r'\n+', '')
df['phone'] = [parser.phones(x) for x in df.PersonEntry]
df['email'] = [parser.emails(x) for x in df.PersonEntry]
df['address'] = [parser.btc_addresses(x) for x in df.PersonEntry]
# NAME
#df['Name'] = df['PersonEntry'].str.split('(',1,expand=True)
# ADDRESS # + street address + city +wa (2 digit) +zip
#df['ADDREASS'] = df['PersonEntry'].str.findall("([0-9]+[a-zA-Z0-9_.+-]+\s[a-zA-Z]+\,[a-zA-Z{2}]+[0-9-]+)")
#df['ADDRESS'] = df['PersonEntry'].str.findall(/\d+(\s+\w+){1,}\s+(?:st(?:\.|reet)?|dr(?:\.|ive)?|pl(?:\.|ace)?|ave(?:\.|nue)?|rd|road|lane|drive|way|court|plaza|square|run|parkway|point|pike|square|driveway|trace|park|terrace|blvd)/)
# MEMBER SINCE
df['since'] = df['PersonEntry'].str.findall("Member of OVERSEAS BRATS since+'\d+")
# type
df['AFB'] = df['PersonEntry'].str.findall("Air Force Brat")
df['Army'] = df['PersonEntry'].str.findall("Army Brat")
df['Navy'] = df['PersonEntry'].str.findall("Navy Brat")
df.head(10)

Python …. Publish that code.

So you have a cool .py program that does something useful… for you. Thats great. Nice job. But now you want that code to be used by someone else, someone unfamiliar with python or any program for that matter besides maybe the web. You know its possible but with silly nonsensical names like flask, mongo, blue ocean, pythonanywhere, and what seems to be a hundred others who knows how?

This is page is written to assist to shine some light on that for you and hopefully be a reminder for me on how to do it too.  Most of what you read here has been gleaned from others websites, you tube videos, etc.  I will source where I remember.

So first lets get on the same page about what you have and need. I like to think of it as the architecture but really its all the bits and pieces.

  1. Some python code to do something. In my case, my python reads in an employee payroll report, performs some cleaning, transforms and outputs another excel file with specific details for union reporting. This is all fine and dandy, but I’m not the one that needs to create these reports, my accountant is. So I need to create a webpage that uploads a file, runs my python script on it and saves the file to a location selected by the accountant.  At this time I don’t have a need to have a database connected to save any of the content, but I think this would be a good nice to have to add on later.
  2. Next we need a way to create the webpage to be the interface.  I’ve selected to use python as my programming language for this too, and Flask to be the program.  Other choices are .net, and ….
  3. Now we need to host it. Thats the part that lets the world or at least those that you give a username and password to, see it.

thats it.  Okay so that sounds simple but lets get into some details.