jeudi 23 juin 2016

Sort/create columns from a .csv of year-quarters with proportions for categories in an additional column


I am new to python and pandas. I've been learning how to do things online and asking a few experienced programmers. This is the first script I've had to write from scratch, so I apologize for basics I may be missing. For this task I am using a jupyter shell and python 2.7. I have a csv file sorted in columns. The columns are sorted by index, date, title, text, and category. Only a few of the dates for each month are listed. There are many years. Each text/date is defined by categories that repeat themselves. My objective is to sort the years by quarter, and in each quarter I would like to list the proportions of categories, using pandas and python. I managed to turn the date strings (YYYYMMDD) into date values (YYYY-MM-DD) with datetime and create a (sorted from oldest to newest) separate column, and label the rows by quarter (1,2,3,4) in another column, but I need to list the proportions of categories in each quarter, for each year. So ideally I should have year-quarter-category+frequency. So I just added 2 new columns: one with a sorted YYYY-MM-DD format, and each of those days is labeled in the 2nd column with either 1,2,3, or 4. But these columns are not contained in any table, and I haven't been able to create one for them. I think I have to create a new .csv because I've read that I can't write to the existing one. I'm pretty sure I should have grouped the rows by year but I don't know how to go back and fix it. I've been trying to work this out all day, and I don't even know where to start fixing my problems. Any guidance would be immensely appreciated. Thanks for reading. My code: import pandas as pd from pandas import DataFrame import datetime as dt import numpy as np df = pd.read_csv('...somefile.csv', delimiter=',', usecols=('Date','Title','Text','Category'), encoding='utf-8') for line in df: df['DateTime'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d') df['DateTime'] = sorted(df['DateTime']) df['quarter'] = df['DateTime'].dt.quarter print line

Aucun commentaire:

Enregistrer un commentaire