csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
# csvkit is to tabular data what the standard Unix text processing suite
# (grep, sed, cut, sort) is to text. As such, csvkit adheres to the Unix
# philosophy.
#
# [Getting started](http://csvkit.readthedocs.org/en/latest/tutorial/getting_started.html)
# ================
# Install csvkit =
# ================
$ pip install csvkit
# ===================
# get some csv data =
# ===================
$ wget -O 2009.csv -U "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16" http://www.data.gov/download/4029/csv
$ wget -O 2010.csv -U "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16" http://www.data.gov/download/4509/csv
# Use head to check the first five lines:
$ head -n 5 2009.csv
$ head -n 5 2010.csv
# As you can see, `2010.csv` has multiple header lines.
# We need to modify this file to fix the issue, but as a matter of
# best practice let's backup our originals first:
$ cp 2009.csv 2009_original.csv && cp 2010.csv 2010_original.csv
# Select lines 1 and 2 of `2010_original.csv`and (d)elete them
$ cat 2010_original.csv | sed "1,2d" > 2010.csv
# ==================================
## Cutting up the data with csvcut =
# ==================================
# List columns. [-n] displays column numbers in output.
$ csvcut -n 2009.csv
> 1: State Name
> 2: State Abbreviate
> 3: Code
> 4: Montgomery GI Bill-Active Duty
> 5: Montgomery GI Bill- Selective Reserve
> 6: Dependents' Educational Assistance
> 7: Reserve Educational Assistance Program
> 8: Post-Vietnam Era Veteran's Educational Assistance Program
> 9: TOTAL
> 10:
# Grab columns 2 and 3. [-c] column indices or names to be extracted.
$ csvcut -c 2,3 2009.csv
> State Name,Code
> AL,01
> AK,02
> AZ,04
> AR,05
# Flipping column order with csvcut
$ csvcut -c 9,1 2009.csv | head -n 5
> TOTAL,State Name
> 12426,ALABAMA
> 1158,ALASKA
> 33986,ARIZONA
> 5513,ARKANSAS
# ===================================
# Statistics on demand with csvstat =
# ===================================
$ csvcut -c 1,4,9,10 2009.csv | csvstat
> 1. State Name
> <type 'unicode'>
> Nulls: Yes
> Unique values: 53
> Max length: 17
> Samples: "FLORIDA", "IDAHO", "ARIZONA", "OHIO", "IOWA"
> 2. Montgomery GI Bill-Active Duty
> <type 'int'>
> Nulls: Yes
> Min: 435
> Max: 34942
> Mean: 6263
> Median: 3548.0
> Unique values: 53
> 3. TOTAL
> <type 'int'>
> Nulls: Yes
> Min: 768
> Max: 46897
> Mean: 9748
> Median: 6520.0
> Unique values: 53
> 4.
> Empty column
# =================================
# Searching for rows with csvgrep =
# =================================
# Find states that begin w/ the letter "I"
$ csvcut -c 1,"TOTAL" 2009.csv | csvgrep -c 1 -r "^I"
> State Name,TOTAL
> ILLINOIS,"21,964"
# ======================
# Sorting with csvsort =
# ======================
# Sort the rows by the first column. [-r] sort in descending order.
$ csvcut -c 9,1 2009.csv | csvsort -r | head -n 5
> TOTAL,State Name
> 46897,CALIFORNIA
> 40402,TEXAS
> 36394,FLORIDA
> 33986,ARIZONA
# ======================================
# Using line numbers as proxy for rank =
# ======================================
# The -l flag is a special flag that can be passed to any csvkit utility
# in order to add a column of line numbers to its output. Since this data
# is being sorted we can use those line numbers as a proxy for rank:
$ csvcut -c 9,1 2009.csv | csvsort -r -l | head -n 11
> line_number,TOTAL,State Name
> 1,46897,CALIFORNIA
> 2,40402,TEXAS
> 3,36394,FLORIDA
> 4,33986,ARIZONA
> 5,21964,ILLINOIS
> 6,20541,VIRGINIA
> 7,18236,GEORGIA
> 8,15730,NORTH CAROLINA
> 9,13967,NEW YORK
> 10,13962,MISSOURI
# Missouri had the tenth largest population of individuals claiming veterans
# education benefits.
# ===================================
# Reading through data with csvlook =
# ===================================
# Display the data in a fixed-width table
$ csvcut -c 9,1 2009.csv | csvsort -r -l | csvlook
> ---------------------------------------------
> | line_number | TOTAL | State Name |
> ---------------------------------------------
> | 1 | 46897 | CALIFORNIA |
> | 2 | 40402 | TEXAS |
> | 3 | 36394 | FLORIDA |
> | 4 | 33986 | ARIZONA |
> | 5 | 21964 | ILLINOIS |
> | 6 | 20541 | VIRGINIA |
> | 7 | 18236 | GEORGIA |
> | 8 | 15730 | NORTH CAROLINA |
> | 9 | 13967 | NEW YORK |
> | 10 | 13962 | MISSOURI |
> [...]
# ==================
# Saving your work =
# ==================
# The complete ranking might be a useful thing to have around. Rather than
# computing it every time, let's use output redirection to save a copy of it:
$ csvcut -c 9,1 2009.csv | csvsort -r -l > 2009_ranking.csv
# ================================================
# Converts various tabular data formats into CSV =
# ================================================
# Convert an Excel .xls file:
$ in2csv test.xls
# Standardize the formatting of a CSV file (quoting, line endings, etc.):
$ in2csv examples/realdata/FY09_EDU_Recipients_by_State.csv
# Fetch csvkit's open issues from the Github API, convert the JSON response into a CSV and write it to a file:
$ curl https://api.github.com/repos/onyxfish/csvkit/issues?state=open | in2csv -f json -v > issues.csv
# Convert a DBase DBF file to an equivalent CSV:
$ in2csv examples/testdbf.dbf > testdbf_converted.csv
# Fetch the ten most recent robberies in Oakland, convert the GeoJSON
# response into a CSV and write it to a file:
$ curl "http://oakland.crimespotting.org/crime-data?format=json&type=robbery&count=10" \
| in2csv -f geojson > robberies.csv
# ============================================
# Cleans a CSV file of common syntax errors. =
# ============================================
# Cleans a CSV file of common syntax errors
$ csvclean -n examples/bad.csv
> Line 3: Expected 3 columns, found 4 columns
> Line 4: Expected 3 columns, found 2 columns
# ===============================================================
# Converts a CSV file into JSON or GeoJSON (depending on flags) =
# ===============================================================
# Convert veteran's education dataset to JSON keyed by state abbreviation:
$ csvjson -k "State Abbreviate" -i 4 examples/realdata/FY09_EDU_Recipients_by_State.csv
> {
> [...]
> "WA":
> {
> "": "",
> "Code": "53",
> "Reserve Educational Assistance Program": "549",
> "Dependents' Educational Assistance": "2,192",
> "Montgomery GI Bill-Active Duty": "7,969",
> "State Name": "WASHINGTON",
> "Montgomery GI Bill- Selective Reserve": "769",
> "State Abbreviate": "WA",
> "Post-Vietnam Era Veteran's Educational Assistance Program": "13",
> "TOTAL": "11,492"
> },
> [...]
> }
# Converting locations of public art into GeoJSON:
$ csvjson --lat latitude --lon longitude --k slug --crs EPSG:4269 -i 4 examples/test_geo.csv
> {
> "type": "FeatureCollection",
> "bbox": [
> -95.334619,
> 32.299076986939205,
> -95.250699,
> 32.351434
> ],
> "crs": {
> "type": "name",
> "properties": {
> "name": "EPSG:4269"
> }
> },
> "features": [
> {
> "geometry": {
> "type": "Point",
> "coordinates": [
> -95.30181,
> 32.35066
> ]
> },
> "type": "Feature",
> "id": "dcl",
> "properties": {
> "photo_credit": "",
> "description": "In addition to being the only coffee shop in downtown Tyler, DCL also features regular exhibitions of work by local artists.",
> "artist": "",
> "title": "Downtown Coffee Lounge",
> "install_date": "",
> "address": "200 West Erwin Street",
> "last_seen_date": "3/30/12",
> "type": "Gallery",
> "photo_url": ""
> }
> },
> [...]
> ]
> }
# ========================================
# Generate SQL statements for a CSV file =
# ========================================
# - access
# - sybase
# - sqlite
# - informix
# - firebird
# - mysql
# - oracle
# - maxdb
# - postgresql
# - mssql
# Generate SQL statements for a CSV file or execute those statements
# directly on a database. In the latter case supports both creating
# tables and inserting data:
# Generate a statement in the PostgreSQL dialect:
$ csvsql -i postgresql examples/realdata/FY09_EDU_Recipients_by_State.csv
# Create a table and import data from the CSV directly into Postgres:
$ createdb test
$ csvsql --db postgresql:///test --table fy09 --insert examples/realdata/FY09_EDU_Recipients_by_State.csv
# For large tables it may not be practical to process the entire table.
# One solution to this is to analyze a sample of the table. In this case it
# can be useful to turn off length limits and null checks with the
# `no-constraints` option:
$ head -n 20 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvsql --no-constraints --table fy09
# Create tables for an entire folder of CSVs and import data from those files directly into Postgres:
$ createdb test
$ csvsql --db postgresql:///test --insert examples/*.csv