Skip to content

Commit c385c7a

Browse files
authored
Merge branch 'main' into main
2 parents 07aa6f4 + 5da24bb commit c385c7a

File tree

6 files changed

+106
-59
lines changed

6 files changed

+106
-59
lines changed

README.md

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ $ pip install -i https://test.pypi.org/simple/ eda_utils_py
1515
## Functions
1616

1717
The four functions contained in this package are as follows:
18-
- `cor_map`: A function to plot a correlation matrix of numeric columns in the dataframe
18+
- `imputer`: A function to impute missing values
1919
- `outlier_identifier`: A function to identify and deal with outliers
20+
- `cor_map`: A function to plot a correlation matrix of numeric columns in the dataframe
2021
- `scale` A function to scale numerical values in the dataset
21-
- `imputer`: A function to impute missing values
2222

2323

2424
## Our Place in the Python Ecosystem
@@ -33,9 +33,9 @@ While Python packages with similar functionalities exist, this package aims to s
3333
- Please see a list of dependencies [here](pyproject.toml).
3434

3535
## Usage
36-
The eda_utils_py package help you to build exploratory data analysis.
36+
The eda_utils_py package will help you in your exploratory data analysis portion of your work.
3737

38-
eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both object and graphical form.
38+
eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. Depending on the function, the generated output can be obtained in object or graphical form.
3939

4040
```python
4141
import pandas as pd
@@ -59,39 +59,56 @@ data_with_outlier = pd.DataFrame({
5959
'SepalWidthCm':[1.4, 1.4, 1.3, 1.2, 1.2, 1.3, 1.6, 1.3],
6060
'PetalWidthCm':[0.2, 0.1, 30, 0.2, 0.3, 0.1, 0.4, 0.5]
6161
})
62+
63+
data_with_scale = pd.DataFrame({'SepalLengthCm':[1, 0, 0, 3, 4],
64+
'SepalWidthCm':[4, 1, 1, 0, 1],
65+
'PetalWidthCm':[2, 0, 0, 2, 1],
66+
'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica', 'Iris-virginica','Iris-germanica']})
6267
```
6368

64-
The eda_utils_py will help you to:
65-
- Diagnose data quality: Resolve skewed data by identifing missing data and outlier and provide corresponding remedy.
69+
The eda_utils_py package contains functions that will help you to:
70+
- **Impute**: Resolve skewed data by identifying missing data and outlier and provide corresponding remedy.
6671

6772
```python
6873
imputer(data_with_NA)
6974
```
70-
Output:
75+
Output of `imputer()`:
7176

7277
![imputer_output](images/imputer_output.png)
7378

79+
- **Identify Outliers**: Identify and deal with outliers in the dataset.
80+
7481
```python
7582
outlier_identifier(data_with_outlier, method = "median")
7683
```
77-
Output:
84+
Output of `outlier_identifier()`:
7885

7986
![outlier_output](images/outlier_output.png)
8087

81-
- This package can help you easily plot a correlation matrix along with its values to help explore data.
88+
- **Correlation Heatmap Plotting**: Easily plot a correlation matrix along with its values to help explore data.
8289

8390
```python
8491
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
8592

8693
cor_map(data, numerical_columns, col_scheme = 'purpleorange')
8794

8895
```
89-
Output:
96+
Output of `cor_map()`:
9097

9198
![cor_map_output](images/cor_map.output.png)
9299

93-
- Machine learning pereperation: Perform column transformations, derive scaler automatically to fulfill further machine learning need
94-
100+
- **Scaling**: Scale the data in preperation for future use in machine learning projects.
101+
102+
```python
103+
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
104+
105+
scale(data, numerical_columns, scaler="minmax")
106+
107+
```
108+
Output of `scale()`:
109+
110+
![scale_output](images/scale_output.png)
111+
95112
## Documentation
96113

97114
The official documentation is hosted on Read the Docs: https://eda_utils_py.readthedocs.io/en/latest/

eda_utils_py/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
__version__ = '0.1.9'
1+
__version__ = '0.1.12'
2+

eda_utils_py/eda_utils_py.py

Lines changed: 47 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ def imputer(df, strategy="mean", fill_value=None):
5555

5656
# Tests whether input fill_value is of type numbers or None
5757
if not isinstance(fill_value, type(None)) and not isinstance(
58-
fill_value, numbers.Number
58+
fill_value, numbers.Number
5959
):
6060
raise TypeError("fill_value must be of type None or numeric type")
6161

@@ -159,13 +159,17 @@ def cor_map(dataframe, num_col, col_scheme="purpleorange"):
159159

160160
plot = (
161161
alt.Chart(corr_matrix)
162-
.mark_rect()
163-
.encode(
162+
.mark_rect()
163+
.encode(
164164
x=alt.X("var1", title=None),
165165
y=alt.Y("var2", title=None),
166-
color=alt.Color("cor", title = 'Correlation', scale=alt.Scale(scheme=col_scheme, domain = (-1,1))),
166+
color=alt.Color(
167+
"cor",
168+
title="Correlation",
169+
scale=alt.Scale(scheme=col_scheme, domain=(-1, 1)),
170+
),
167171
)
168-
.properties(title="Correlation Matrix", width=400, height=400)
172+
.properties(title="Correlation Matrix", width=400, height=400)
169173
)
170174

171175
text = plot.mark_text(size=15).encode(
@@ -195,7 +199,7 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
195199
- if "trim" : we completely remove data points that are outliers.
196200
- if "median" : we replace outliers with median values
197201
- if "mean" : we replace outliers with mean values
198-
202+
199203
200204
Returns
201205
-------
@@ -206,13 +210,15 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
206210
--------
207211
>> import pandas as pd
208212
>> from eda_utils_py import cor_map
213+
209214
210215
>> df = pd.DataFrame({
211216
>> 'SepalLengthCm' : [5.1, 4.9, 4.7, 5.5, 5.1, 50, 5.4, 5.0, 5.2, 5.3, 5.1],
212217
>> 'SepalWidthCm' : [1.4, 1.4, 20, 2.0, 0.7, 1.6, 1.2, 1.4, 1.8, 1.5, 2.1],
213218
>> 'PetalWidthCm' : [0.2, 0.2, 0.2, 0.3, 0.4, 0.5, 0.5, 0.6, 0.4, 0.2, 5]
214219
>>})
215220
221+
216222
>> outlier_identifier(data)
217223
>> SepalLengthCm SepalWidthCm PetalWidthCm
218224
>> 0 5.1 1.4 0.2
@@ -231,24 +237,30 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
231237
if columns is None:
232238
for col in dataframe.columns:
233239
if not is_numeric_dtype(dataframe[col]):
234-
raise Exception("The given dataframe contains column that is not numeric column.")
240+
raise Exception(
241+
"The given dataframe contains column that is not numeric column."
242+
)
235243

236244
if columns is not None:
237245
if not isinstance(columns, list):
238246
raise TypeError("The argument @columns must be of type list")
239247

240248
for col in columns:
241249
if col not in list(dataframe.columns):
242-
raise Exception("The given column list contains column that is not exist in the given dataframe.")
250+
raise Exception(
251+
"The given column list contains column that is not exist in the given dataframe."
252+
)
243253
if not is_numeric_dtype(dataframe[col]):
244-
raise Exception("The given column list contains column that is not numeric column.")
254+
raise Exception(
255+
"The given column list contains column that is not numeric column."
256+
)
245257

246258
if method not in ("trim", "median", "mean"):
247259
raise Exception("The method must be -trim- or -median- or -mean-")
248260

249261
df = dataframe.copy()
250262
target_columns = []
251-
if (columns is None):
263+
if columns is None:
252264
target_columns = list(df.columns.values.tolist())
253265
else:
254266
target_columns = columns
@@ -264,14 +276,14 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
264276
current_item = current_column[i]
265277
z = (current_item - mean) / std
266278
if z >= threshold:
267-
if (i not in outlier_index):
279+
if i not in outlier_index:
268280
outlier_index.append(i)
269-
if (method == "mean"):
281+
if method == "mean":
270282
df.at[i, column] = round(mean, 2)
271-
if (method == "median"):
283+
if method == "median":
272284
df.at[i, column] = np.median(current_column)
273285

274-
if (method == "trim"):
286+
if method == "trim":
275287
df = df.drop(outlier_index)
276288

277289
df.index = range(len(df))
@@ -314,12 +326,12 @@ def scale(dataframe, columns, scaler="standard"):
314326
315327
>> scale(data, numerical_columns, scaler="minmax")
316328
317-
SepalLengthCm SepalWidthCm PetalWidthCm
318-
0 0.25 1.00 1.0
319-
1 0.00 0.25 0.0
320-
2 0.00 0.25 0.0
321-
3 0.75 0.00 1.0
322-
4 1.00 0.25 0.5
329+
>> SepalLengthCm SepalWidthCm PetalWidthCm
330+
>> 0 0.25 1.00 1.0
331+
>> 1 0.00 0.25 0.0
332+
>> 2 0.00 0.25 0.0
333+
>> 3 0.75 0.00 1.0
334+
>> 4 1.00 0.25 0.5
323335
"""
324336

325337
# Check if input data is of pd.DataFrame type
@@ -340,7 +352,7 @@ def scale(dataframe, columns, scaler="standard"):
340352
if col not in list(dataframe.columns):
341353
raise Exception("The given column names must exist in the given dataframe.")
342354

343-
# Check if all input columns in num_col are numeric columns
355+
# Check if all input columns in columns are numeric columns
344356
for col in columns:
345357
if not is_numeric_dtype(dataframe[col]):
346358
raise Exception("The given numerical columns must all be numeric.")
@@ -349,16 +361,6 @@ def scale(dataframe, columns, scaler="standard"):
349361
if not isinstance(scaler, str):
350362
raise TypeError("Scaler must be of type str")
351363

352-
# Check if all input columns exist in the input data
353-
for col in columns:
354-
if col not in list(dataframe.columns):
355-
raise Exception("The given column names must exist in the given dataframe.")
356-
357-
# Check if all input columns in num_col are numeric columns
358-
for col in columns:
359-
if not is_numeric_dtype(dataframe[col]):
360-
raise Exception("The given columns must all be numeric.")
361-
362364
scaled_df = None
363365
if scaler == "minmax":
364366
scaled_df = _minmax(dataframe[columns])
@@ -396,24 +398,24 @@ def _standardize(dataframe):
396398

397399
def _minmax(dataframe):
398400
"""Transform features by rescaling each feature to the range between 0 and 1.
399-
The transformation is given by:
401+
The transformation is given by:
400402
401-
scaled_value = (feature_value - min) / (mix - min)
403+
scaled_value = (feature_value - min) / (mix - min)
402404
403-
where min, max = feature_range.
405+
where min, max = feature_range.
404406
405-
This transformation is often used as an alternative to zero mean,
406-
unit variance scaling.
407+
This transformation is often used as an alternative to zero mean,
408+
unit variance scaling.
407409
408-
Parameters
409-
----------
410-
dataframe : pandas.DataFrame
411-
The data frame to be used for EDA.
412-
Returns
413-
-------
414-
res : pandas.core.frame.DataFrame
415-
Scaled dataset
416-
"""
410+
Parameters
411+
----------
412+
dataframe : pandas.DataFrame
413+
The data frame to be used for EDA.
414+
Returns
415+
-------
416+
res : pandas.core.frame.DataFrame
417+
Scaled dataset
418+
"""
417419

418420
res = dataframe.copy()
419421
for feature_name in dataframe.columns:

images/scale_output.png

14 KB
Loading

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "eda_utils_py"
3-
version = "0.1.9"
3+
version = "0.1.12"
44
description = "Python package that contains util functions for eda process"
55
authors = ["Chuang Wang <chuangw.sde@gmail.com>"]
66
license = "MIT"

tests/test_eda_utils_py.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,14 @@ def test_cor_map():
173173

174174

175175
def test_scaler():
176+
data = pd.DataFrame(
177+
{
178+
"SepalLengthCm": [5.1, 4.9, 4.7],
179+
"SepalWidthCm": [1.4, 1.4, 1.3],
180+
"PetalWidthCm": [0.2, 0.1, 0.2],
181+
"Species": ["Iris-setosa", "Iris-virginica", "Iris-germanica"],
182+
}
183+
)
176184
mock_df_1 = pd.DataFrame(
177185
{"col1": [1, 0, 0, 3, 4], "col2": [4, 1, 1, 0, 1], "col3": [2, 0, 0, 2, 1]}
178186
)
@@ -225,6 +233,25 @@ def test_scaler():
225233
mock_df_2, ["col1", "col2"], scaler="minmax"
226234
)
227235

236+
# Test if the imput is not dataFrame
237+
with raises(TypeError):
238+
eda_utils_py.scale("A string", ['one', 'two'])
239+
240+
# Tests if contents of columns is not of type str
241+
with raises(TypeError):
242+
eda_utils_py.scale(mock_df_1, [1, 2, 3, 4])
243+
244+
with raises(TypeError):
245+
eda_utils_py.scale(mock_df_1, [None])
246+
247+
# Tests if columns do not exist in the dataframe
248+
with raises(Exception):
249+
eda_utils_py.scale(mock_df_1, ['one', 'two'])
250+
251+
# Tests if if not all columns in columns are numeric
252+
with raises(Exception):
253+
eda_utils_py.scale(data, ['Species'])
254+
228255
# Tests whether data is not of type pd.Dataframe raises TypeError
229256
with raises(TypeError):
230257
eda_utils_py.scale([14, None, 3, 27])

0 commit comments

Comments
 (0)