I did this one analytically and by brute force simulation, and got the same answer with both methods. It’s in a github gist.
Tag: Jupyter
Tidying up ONS data for Pandas
I take an interest in what the UK’s Office for National Statistics puts out, especially around employment and the economy. I’m also learning Jupyter and the Python DS tools, so I’ve taken one of their data series and tidied it up to use in Pandas.
Applied Data Science with Python and Jupyter, Alex Galea, 2018 – Chapter 1
Basic system for Jupyter is a web front end to little pockets of code that execute on the backend; setup means getting the server running.
Assume this means that each notebook has its own kernel running on the server? Or not running, something more like a session.
Notebooks can be saved out as Python or HTML files.
DataFrames: created by Pandas constructor. Have a describe() method that gives summaries of individual variables. corr() gives a correlation matrix.
Seaborn pairplot: exactly what I wanted when working on reports at VCS – pairwise plots of variables against each other.
ndarray.reshape: reshapes the x-y sizes of an array; param values of -1 for a dimension mean that the correct value is inferred from other values.
sklearn.preprocessing.PolynomialFeatures: returns an object capable of transforming data frames, e.g. with degree of 2, and one-dimensional input the output frame would contain a frame for each input value, containing the value to the powers zero, one and two (i.e. the number one, the input value, and the input value squared).
sklearn.linear_model.LinearRegression: gives an object which can perform linear regression (multi-linear in the example)
There’s a bug in the last section, about categorical features. The cell that starts, “# Color-segmented pair plot” contains this:
sns.pairplot(df[cols], hue='AGE_category',
hue_order=['Relatively New', 'Relatively Old',
'Very Old'], plot_kws={'alpha': 0.5},
diag_kws={'bins': 30})
But this throws an AttributeError – ‘Line2D’ has no property ‘bins’. Removing the parameter diag_kws={‘bins’: 30} leaves the call running properly.