= sns.pairplot(d.sample(400)[cols], plot_kws={"marker":"+", "linewidth":1})
pplot =4, color=".2") pplot.map_lower(sns.kdeplot, levels
A collection of some code I find useful
I will find a dataset to show each example, but for now it’s just the code.
Seaborn is imported as sns
.
The pandas dataframe is stored as d
The pairplot function is useful to quickly create scatterplots for all of the variables in a table (in the list variable “cols”). With the scatterplots it is easy to identify trends in the data. Additionally with the kdeplot (kernel density estimation) highlights where datapoints are clustered in each scatterplot. Sampling is used to prevent datapoints from overcrowding in large datasets.
=d, x="xcolname", \
sns.kdeplot(data="ycolname", fill=True, hue="left") y
After noticing something in the pairplot kdeplot would be used to get a closer look.
= ["a","b"]
col for i in col:
= d[i].quantile(.75)
up = d[i].quantile(.25)
low = up-low
iqr = iqr*1.5
d_range = d[(d[i]<(d_range+up))&(d[i]>(low-d_range))] d
Just a loop to get rid of outliers
= pd.get_dummies(d, columns=["categoricalcol", "categoricalcol"], drop_first=False)
d = d.drop("ycol", axis=1)
X = d.ycol
y = train_test_split(X, y, test_size=0.25)
X_train, X_test, y_train, y_test
= XGBClassifier(objective="binary:logistic")
xgb
={'max_depth': [8],
cv_params'learning_rate':[0.01,0.1],
'min_child_weight':[4],
'n_estimators': [300]}
= {"f1","recall","accuracy","precision"}
scoring = GridSearchCV(xgb, cv_params, scoring=scoring, cv=3, refit="f1")
xgbm
xgbm.fit(X_train, y_train) plot_importance(xgbm.best_estimator_)
A boilerplate of sorts for xgboost. As for what parameters to use ¯\_(ツ)_/¯ (More scoring methods: https://scikit-learn.org/stable/modules/model_evaluation.html )