Skip to content

Commit 98f7281

Browse files
authored
Merge pull request #4 from UBC-MDS/main
update
2 parents d10487a + 9470238 commit 98f7281

File tree

9 files changed

+864
-134
lines changed

9 files changed

+864
-134
lines changed

.github/workflows/build.yml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: build
2+
3+
on:
4+
# Trigger the workflow on push or pull request to main
5+
push:
6+
branches:
7+
- main
8+
pull_request:
9+
branches:
10+
- main
11+
12+
jobs:
13+
build:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v2
17+
- name: Set up Python 3.8
18+
uses: actions/setup-python@v1
19+
with:
20+
python-version: 3.8
21+
- name: Install dependencies
22+
run: |
23+
pip install poetry
24+
poetry install
25+
# - name: Check style
26+
# run: poetry run flake8 --exclude=docs*
27+
- name: Test with pytest
28+
run: poetry run pytest --cov=./ --cov-report=xml
29+
- name: Upload coverage to Codecov
30+
uses: codecov/codecov-action@v1
31+
with:
32+
token: ${{ secrets.CODECOV_TOKEN }}
33+
file: ./coverage.xml
34+
flags: unittests
35+
name: codecov-umbrella
36+
fail_ci_if_error: true

.github/workflows/deploy.yml

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
name: Deploy
2+
3+
on:
4+
# Trigger the workflow on push or pull request to main
5+
push:
6+
branches:
7+
- main
8+
9+
jobs:
10+
build:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v2
14+
- name: Set up Python 3.8
15+
uses: actions/setup-python@v1
16+
with:
17+
python-version: 3.8
18+
- name: Install dependencies
19+
run: |
20+
pip install poetry
21+
poetry install
22+
- name: Test with pytest
23+
run: poetry run pytest --cov=./ --cov-report=xml
24+
- name: Upload coverage to Codecov
25+
uses: codecov/codecov-action@v1
26+
with:
27+
token: ${{ secrets.CODECOV_TOKEN }}
28+
file: ./coverage.xml
29+
flags: unittests
30+
name: codecov-umbrella
31+
fail_ci_if_error: true
32+
- name: checkout
33+
uses: actions/checkout@master
34+
with:
35+
ref: main
36+
- name: Bump version and tagging and publish
37+
run: |
38+
git config --local user.email "action@github.com"
39+
git config --local user.name "GitHub Action"
40+
git pull origin main
41+
poetry run semantic-release version
42+
poetry version $(grep "version" */__init__.py | cut -d "'" -f 2 | cut -d '"' -f 2)
43+
git commit -m "Bump versions" -a
44+
- name: Push package version changes
45+
uses: ad-m/github-push-action@master
46+
with:
47+
github_token: ${{ secrets.GITHUB_TOKEN }}
48+
- name: Get release tag version from package version
49+
run: |
50+
echo ::set-output name=release_tag::$(grep "version" */__init__.py | cut -d "'" -f 2 | cut -d '"' -f 2)
51+
id: release
52+
- name: Create Release with new version
53+
id: create_release
54+
uses: actions/create-release@v1
55+
env:
56+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
57+
with:
58+
tag_name: ${{ steps.release.outputs.release_tag }}
59+
release_name: ${{ steps.release.outputs.release_tag }}
60+
draft: false
61+
prerelease: false
62+
- name: Build package and publish to test PyPI
63+
env:
64+
TEST_PYPI_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }}
65+
TEST_PYPI_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }}
66+
run: |
67+
poetry config repositories.test-pypi https://test.pypi.org/legacy/
68+
poetry build
69+
poetry publish -r test-pypi -u $TEST_PYPI_USERNAME -p $TEST_PYPI_PASSWORD
70+

README.md

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# eda_utils_py
22

3-
![](https://github.com/chuangw46/eda_utils_py/workflows/build/badge.svg) [![codecov](https://codecov.io/gh/chuangw46/eda_utils_py/branch/main/graph/badge.svg)](https://codecov.io/gh/chuangw46/eda_utils_py) ![Release](https://github.com/chuangw46/eda_utils_py/workflows/Release/badge.svg) [![Documentation Status](https://readthedocs.org/projects/eda_utils_py/badge/?version=latest)](https://eda_utils_py.readthedocs.io/en/latest/?badge=latest)
3+
[![build](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/build.yml/badge.svg)](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/build.yml) ![](https://github.com/chuangw46/eda_utils_py/workflows/build/badge.svg) [![codecov](https://codecov.io/gh/UBC-MDS/eda_utils_py/branch/main/graph/badge.svg)](https://codecov.io/gh/UBC-MDS/eda_utils_py) [![Deploy](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/deploy.yml/badge.svg)](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/deploy.yml) [![Documentation Status](https://readthedocs.org/projects/eda_utils_py/badge/?version=latest)](https://eda_utils_py.readthedocs.io/en/latest/?badge=latest)
44

55
## Overview
66

@@ -30,19 +30,43 @@ While Python packages with similar functionalities exist, this package aims to s
3030

3131
## Dependencies
3232

33-
- TBD
33+
- Please see a list of dependencies [here](pyproject.toml).
3434

3535
## Usage
3636
The eda_utils_py package help you to build exploratory data analysis.
3737

3838
eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both object and graphical form.
3939

40-
The eda_utils_py is capable of :
41-
- Diagnose data quality : Resolve skewed data by identifing missing data and outlier and provide corresponding remedy.
42-
- Discover data: Plot correlation mattrix to help explore data to understand the data and find scenarios for performing the analysis.
43-
- Machine learning pereperation : Perform column transformations, derive scaler automatically to fulfill further machine learning need
44-
40+
```python
41+
import pandas as pd
42+
from eda_utils_py import eda_utils_py
43+
44+
data = pd.DataFrame({
45+
'SepalLengthCm':[5.1, 4.9, 4.7],
46+
'SepalWidthCm':[1.4, 1.4, 1.3],
47+
'PetalWidthCm':[0.2, 0.1, 0.2],
48+
'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
49+
})
50+
```
51+
52+
The eda_utils_py will help you to:
53+
- Diagnose data quality: Resolve skewed data by identifing missing data and outlier and provide corresponding remedy.
54+
55+
56+
- This package can help you easily plot a correlation matrix along with its values to help explore data.
4557

58+
```python
59+
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
60+
61+
cor_map(data, numerical_columns, col_scheme = 'purpleorange')
62+
63+
```
64+
Output:
65+
66+
![cor_map_output](images/cor_map.output.png)
67+
68+
- Machine learning pereperation: Perform column transformations, derive scaler automatically to fulfill further machine learning need
69+
4670
## Documentation
4771

4872
The official documentation is hosted on Read the Docs: https://eda_utils_py.readthedocs.io/en/latest/

eda_utils_py/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.1.0'
1+
__version__ = '0.1.3'

eda_utils_py/eda_utils_py.py

Lines changed: 41 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
import numpy as np
66

77

8-
98
def imputer(df, strategy="mean", fill_value=None):
109
"""
1110
A function to implement imputation functionality for completing missing values.
@@ -68,7 +67,6 @@ def imputer(df, strategy="mean", fill_value=None):
6867
if isinstance(fill_value, type(None)) and strategy == "constant":
6968
raise Exception("fill_value should be a number when strategy is 'constant'")
7069

71-
7270
result = pd.DataFrame()
7371
if strategy == "mean":
7472
result = df.apply(lambda x: x.fillna(x.mean()), axis=0)
@@ -115,7 +113,7 @@ def cor_map(dataframe, num_col, col_scheme="purpleorange"):
115113
>> data = pd.DataFrame({
116114
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
117115
>> 'SepalWidthCm':[1.4, 1.4, 1.3],
118-
>> 'PetalWidthCm':[0.2, 0.2, 0.2],
116+
>> 'PetalWidthCm':[0.2, 0.1, 0.2],
119117
>> 'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
120118
>> })
121119
@@ -165,7 +163,7 @@ def cor_map(dataframe, num_col, col_scheme="purpleorange"):
165163
.encode(
166164
x=alt.X("var1", title=None),
167165
y=alt.Y("var2", title=None),
168-
color=alt.Color("cor", legend=None, scale=alt.Scale(scheme=col_scheme)),
166+
color=alt.Color("cor", title = 'Correlation', scale=alt.Scale(scheme=col_scheme, domain = (-1,1))),
169167
)
170168
.properties(title="Correlation Matrix", width=400, height=400)
171169
)
@@ -206,17 +204,17 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
206204
207205
Examples
208206
--------
209-
>>> import pandas as pd
210-
>>> from eda_utils_py import cor_map
207+
>> import pandas as pd
208+
>> from eda_utils_py import cor_map
211209
212-
>>> data = pd.DataFrame({
213-
>>> 'SepalLengthCm':[5.1, 4.9, 4.7],
214-
>>> 'SepalWidthCm':[1.4, 1.4, 99],
215-
>>> 'PetalWidthCm:[0.2, 0.2, 0.2],
216-
>>> 'Species':['Iris-setosa', 'Iris-virginica', 'Iris-germanica']
217-
>>> })
210+
>> data = pd.DataFrame({
211+
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
212+
>> 'SepalWidthCm':[1.4, 1.4, 99],
213+
>> 'PetalWidthCm:[0.2, 0.2, 0.2],
214+
>> 'Species':['Iris-setosa', 'Iris-virginica', 'Iris-germanica']
215+
>> })
218216
219-
>>> outlier_identifier(data)
217+
>> outlier_identifier(data)
220218
221219
222220
"""
@@ -226,59 +224,54 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
226224
if columns is None:
227225
for col in dataframe.columns:
228226
if not is_numeric_dtype(dataframe[col]):
229-
raise Exception("The given dataframe contains column that is not numeric column.")
230-
227+
raise Exception("The given dataframe contains column that is not numeric column.")
228+
231229
if columns is not None:
232230
if not isinstance(columns, list):
233231
raise TypeError("The argument @columns must be of type list")
234-
235-
232+
236233
for col in columns:
237234
if col not in list(dataframe.columns):
238-
raise Exception("The given column list contains column that is not exist in the given dataframe.")
235+
raise Exception("The given column list contains column that is not exist in the given dataframe.")
239236
if not is_numeric_dtype(dataframe[col]):
240237
raise Exception("The given column list contains column that is not numeric column.")
241-
238+
242239
if method not in ("trim", "median", "mean"):
243240
raise Exception("The method must be -trim- or -median- or -mean-")
244-
241+
245242
df = dataframe.copy()
246243
target_columns = []
247-
if(columns is None):
248-
target_columns = list(df.columns.values.tolist())
244+
if (columns is None):
245+
target_columns = list(df.columns.values.tolist())
249246
else:
250247
target_columns = columns
251-
248+
252249
outlier_index = []
253250
for column in target_columns:
254251
current_column = df[column]
255252
mean = np.mean(current_column)
256253
std = np.std(current_column)
257-
threshold = 3
258-
259-
254+
threshold = 3
255+
260256
for i in range(len(current_column)):
261257
current_item = current_column[i]
262258
z = (current_item - mean) / std
263259
if z >= threshold:
264-
if(i not in outlier_index):
260+
if (i not in outlier_index):
265261
outlier_index.append(i)
266-
if(method == "mean"):
262+
if (method == "mean"):
267263
df.at[i, column] = round(mean, 2)
268-
if(method == "median"):
264+
if (method == "median"):
269265
df.at[i, column] = np.median(current_column)
270-
271-
272-
if(method == "trim"):
266+
267+
if (method == "trim"):
273268
df = df.drop(outlier_index)
274-
269+
275270
df.index = range(len(df))
276271
return df
277272

278273

279-
280-
281-
def scale(dataframe, columns=None, scaler="standard"):
274+
def scale(dataframe, columns, scaler="standard"):
282275
"""
283276
A function to scale features either by using standard scaler or minmax scaler method
284277
@@ -304,15 +297,22 @@ def scale(dataframe, columns=None, scaler="standard"):
304297
>> from eda_utils_py import scale
305298
306299
>> data = pd.DataFrame({
307-
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
308-
>> 'SepalWidthCm':[1.4, 1.4, 1.3],
309-
>> 'PetalWidthCm:[0.2, 0.2, 0.2],
300+
>> 'SepalLengthCm':[1, 0, 0, 3, 4],
301+
>> 'SepalWidthCm':[4, 1, 1, 0, 1],
302+
>> 'PetalWidthCm:[2, 0, 0, 2, 1],
310303
>> 'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
311304
>> })
312305
313306
>> numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
314307
315308
>> scale(data, numerical_columns, scaler="minmax")
309+
310+
SepalLengthCm SepalWidthCm PetalWidthCm
311+
0 0.25 1.00 1.0
312+
1 0.00 0.25 0.0
313+
2 0.00 0.25 0.0
314+
3 0.75 0.00 1.0
315+
4 1.00 0.25 0.5
316316
"""
317317

318318
# Check if input data is of pd.DataFrame type
@@ -375,7 +375,7 @@ def _standardize(dataframe):
375375
The data frame to be used for EDA.
376376
Returns
377377
-------
378-
self : object
378+
res : pandas.core.frame.DataFrame
379379
Scaled dataset
380380
"""
381381

@@ -404,7 +404,7 @@ def _minmax(dataframe):
404404
The data frame to be used for EDA.
405405
Returns
406406
-------
407-
self : object
407+
res : pandas.core.frame.DataFrame
408408
Scaled dataset
409409
"""
410410

@@ -415,5 +415,3 @@ def _minmax(dataframe):
415415
res[feature_name] = (dataframe[feature_name] - min) / (max - min)
416416

417417
return res
418-
419-

images/cor_map.output.png

34.3 KB
Loading

0 commit comments

Comments
 (0)