Skip to content

Commit c62eb80

Browse files
committed
Merge branch 'main' of https://github.com/wangjc640/dsci524-group18 into main
update test
2 parents 66f0f1d + 98f7281 commit c62eb80

File tree

9 files changed

+1141
-126
lines changed

9 files changed

+1141
-126
lines changed

.github/workflows/build.yml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: build
2+
3+
on:
4+
# Trigger the workflow on push or pull request to main
5+
push:
6+
branches:
7+
- main
8+
pull_request:
9+
branches:
10+
- main
11+
12+
jobs:
13+
build:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v2
17+
- name: Set up Python 3.8
18+
uses: actions/setup-python@v1
19+
with:
20+
python-version: 3.8
21+
- name: Install dependencies
22+
run: |
23+
pip install poetry
24+
poetry install
25+
# - name: Check style
26+
# run: poetry run flake8 --exclude=docs*
27+
- name: Test with pytest
28+
run: poetry run pytest --cov=./ --cov-report=xml
29+
- name: Upload coverage to Codecov
30+
uses: codecov/codecov-action@v1
31+
with:
32+
token: ${{ secrets.CODECOV_TOKEN }}
33+
file: ./coverage.xml
34+
flags: unittests
35+
name: codecov-umbrella
36+
fail_ci_if_error: true

.github/workflows/deploy.yml

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
name: Deploy
2+
3+
on:
4+
# Trigger the workflow on push or pull request to main
5+
push:
6+
branches:
7+
- main
8+
9+
jobs:
10+
build:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v2
14+
- name: Set up Python 3.8
15+
uses: actions/setup-python@v1
16+
with:
17+
python-version: 3.8
18+
- name: Install dependencies
19+
run: |
20+
pip install poetry
21+
poetry install
22+
- name: Test with pytest
23+
run: poetry run pytest --cov=./ --cov-report=xml
24+
- name: Upload coverage to Codecov
25+
uses: codecov/codecov-action@v1
26+
with:
27+
token: ${{ secrets.CODECOV_TOKEN }}
28+
file: ./coverage.xml
29+
flags: unittests
30+
name: codecov-umbrella
31+
fail_ci_if_error: true
32+
- name: checkout
33+
uses: actions/checkout@master
34+
with:
35+
ref: main
36+
- name: Bump version and tagging and publish
37+
run: |
38+
git config --local user.email "action@github.com"
39+
git config --local user.name "GitHub Action"
40+
git pull origin main
41+
poetry run semantic-release version
42+
poetry version $(grep "version" */__init__.py | cut -d "'" -f 2 | cut -d '"' -f 2)
43+
git commit -m "Bump versions" -a
44+
- name: Push package version changes
45+
uses: ad-m/github-push-action@master
46+
with:
47+
github_token: ${{ secrets.GITHUB_TOKEN }}
48+
- name: Get release tag version from package version
49+
run: |
50+
echo ::set-output name=release_tag::$(grep "version" */__init__.py | cut -d "'" -f 2 | cut -d '"' -f 2)
51+
id: release
52+
- name: Create Release with new version
53+
id: create_release
54+
uses: actions/create-release@v1
55+
env:
56+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
57+
with:
58+
tag_name: ${{ steps.release.outputs.release_tag }}
59+
release_name: ${{ steps.release.outputs.release_tag }}
60+
draft: false
61+
prerelease: false
62+
- name: Build package and publish to test PyPI
63+
env:
64+
TEST_PYPI_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }}
65+
TEST_PYPI_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }}
66+
run: |
67+
poetry config repositories.test-pypi https://test.pypi.org/legacy/
68+
poetry build
69+
poetry publish -r test-pypi -u $TEST_PYPI_USERNAME -p $TEST_PYPI_PASSWORD
70+

README.md

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# eda_utils_py
22

3-
![](https://github.com/chuangw46/eda_utils_py/workflows/build/badge.svg) [![codecov](https://codecov.io/gh/chuangw46/eda_utils_py/branch/main/graph/badge.svg)](https://codecov.io/gh/chuangw46/eda_utils_py) ![Release](https://github.com/chuangw46/eda_utils_py/workflows/Release/badge.svg) [![Documentation Status](https://readthedocs.org/projects/eda_utils_py/badge/?version=latest)](https://eda_utils_py.readthedocs.io/en/latest/?badge=latest)
3+
[![build](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/build.yml/badge.svg)](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/build.yml) ![](https://github.com/chuangw46/eda_utils_py/workflows/build/badge.svg) [![codecov](https://codecov.io/gh/UBC-MDS/eda_utils_py/branch/main/graph/badge.svg)](https://codecov.io/gh/UBC-MDS/eda_utils_py) [![Deploy](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/deploy.yml/badge.svg)](https://github.com/UBC-MDS/eda_utils_py/actions/workflows/deploy.yml) [![Documentation Status](https://readthedocs.org/projects/eda_utils_py/badge/?version=latest)](https://eda_utils_py.readthedocs.io/en/latest/?badge=latest)
44

55
## Overview
66

@@ -30,19 +30,43 @@ While Python packages with similar functionalities exist, this package aims to s
3030

3131
## Dependencies
3232

33-
- TBD
33+
- Please see a list of dependencies [here](pyproject.toml).
3434

3535
## Usage
3636
The eda_utils_py package help you to build exploratory data analysis.
3737

3838
eda_utils_py includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both object and graphical form.
3939

40-
The eda_utils_py is capable of :
41-
- Diagnose data quality : Resolve skewed data by identifing missing data and outlier and provide corresponding remedy.
42-
- Discover data: Plot correlation mattrix to help explore data to understand the data and find scenarios for performing the analysis.
43-
- Machine learning pereperation : Perform column transformations, derive scaler automatically to fulfill further machine learning need
44-
40+
```python
41+
import pandas as pd
42+
from eda_utils_py import eda_utils_py
43+
44+
data = pd.DataFrame({
45+
'SepalLengthCm':[5.1, 4.9, 4.7],
46+
'SepalWidthCm':[1.4, 1.4, 1.3],
47+
'PetalWidthCm':[0.2, 0.1, 0.2],
48+
'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
49+
})
50+
```
51+
52+
The eda_utils_py will help you to:
53+
- Diagnose data quality: Resolve skewed data by identifing missing data and outlier and provide corresponding remedy.
54+
55+
56+
- This package can help you easily plot a correlation matrix along with its values to help explore data.
4557

58+
```python
59+
numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
60+
61+
cor_map(data, numerical_columns, col_scheme = 'purpleorange')
62+
63+
```
64+
Output:
65+
66+
![cor_map_output](images/cor_map.output.png)
67+
68+
- Machine learning pereperation: Perform column transformations, derive scaler automatically to fulfill further machine learning need
69+
4670
## Documentation
4771

4872
The official documentation is hosted on Read the Docs: https://eda_utils_py.readthedocs.io/en/latest/

eda_utils_py/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.1.0'
1+
__version__ = '0.1.3'

eda_utils_py/eda_utils_py.py

Lines changed: 48 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def cor_map(dataframe, num_col, col_scheme="purpleorange"):
113113
>> data = pd.DataFrame({
114114
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
115115
>> 'SepalWidthCm':[1.4, 1.4, 1.3],
116-
>> 'PetalWidthCm':[0.2, 0.2, 0.2],
116+
>> 'PetalWidthCm':[0.2, 0.1, 0.2],
117117
>> 'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
118118
>> })
119119
@@ -163,7 +163,7 @@ def cor_map(dataframe, num_col, col_scheme="purpleorange"):
163163
.encode(
164164
x=alt.X("var1", title=None),
165165
y=alt.Y("var2", title=None),
166-
color=alt.Color("cor", legend=None, scale=alt.Scale(scheme=col_scheme)),
166+
color=alt.Color("cor", title = 'Correlation', scale=alt.Scale(scheme=col_scheme, domain = (-1,1))),
167167
)
168168
.properties(title="Correlation Matrix", width=400, height=400)
169169
)
@@ -185,35 +185,37 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
185185
A function that identify by z-test with threshold of 3, and deal with outliers based on the method the user choose
186186
187187
Parameters
188-
----------
188+
----------
189189
dataframe : pandas.core.frame.DataFrame
190190
The target dataframe where the function is performed.
191191
columns : list, default=None
192192
The target columns where the function needed to be performed. Defualt is None, the function will check all columns
193193
method : string
194-
The method of dealing with outliers.
194+
The method of dealing with outliers.
195195
- if "trim" : we completely remove data points that are outliers.
196196
- if "median" : we replace outliers with median values
197197
- if "mean" : we replace outliers with mean values
198198
199+
199200
Returns
200201
-------
201202
pandas.core.frame.DataFrame
202203
a dataframe which the outlier has already process by the chosen method
203-
204+
204205
Examples
205206
--------
206-
>>> import pandas as pd
207-
>>> from eda_utils_py import cor_map
207+
>> import pandas as pd
208+
>> from eda_utils_py import cor_map
208209
209-
>>> data = pd.DataFrame({
210-
>>> 'SepalLengthCm':[5.1, 4.9, 4.7],
211-
>>> 'SepalWidthCm':[1.4, 1.4, 99],
212-
>>> 'PetalWidthCm:[0.2, 0.2, 0.2],
213-
>>> 'Species':['Iris-setosa', 'Iris-virginica', 'Iris-germanica']
214-
>>> })
210+
>> data = pd.DataFrame({
211+
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
212+
>> 'SepalWidthCm':[1.4, 1.4, 99],
213+
>> 'PetalWidthCm:[0.2, 0.2, 0.2],
214+
>> 'Species':['Iris-setosa', 'Iris-virginica', 'Iris-germanica']
215+
>> })
216+
217+
>> outlier_identifier(data)
215218
216-
>>> outlier_identifier(data)
217219
218220
"""
219221
if not isinstance(dataframe, pd.DataFrame):
@@ -222,58 +224,54 @@ def outlier_identifier(dataframe, columns=None, method="trim"):
222224
if columns is None:
223225
for col in dataframe.columns:
224226
if not is_numeric_dtype(dataframe[col]):
225-
raise Exception("The given dataframe contains column that is not numeric column.")
226-
227+
raise Exception("The given dataframe contains column that is not numeric column.")
228+
227229
if columns is not None:
228230
if not isinstance(columns, list):
229231
raise TypeError("The argument @columns must be of type list")
230-
231-
232+
232233
for col in columns:
233234
if col not in list(dataframe.columns):
234-
raise Exception("The given column list contains column that is not exist in the given dataframe.")
235+
raise Exception("The given column list contains column that is not exist in the given dataframe.")
235236
if not is_numeric_dtype(dataframe[col]):
236237
raise Exception("The given column list contains column that is not numeric column.")
237-
238+
238239
if method not in ("trim", "median", "mean"):
239240
raise Exception("The method must be -trim- or -median- or -mean-")
240241

241-
242242
df = dataframe.copy()
243243
target_columns = []
244-
if(columns is None):
245-
target_columns = list(df.columns.values.tolist())
244+
if (columns is None):
245+
target_columns = list(df.columns.values.tolist())
246246
else:
247247
target_columns = columns
248-
248+
249249
outlier_index = []
250250
for column in target_columns:
251251
current_column = df[column]
252252
mean = np.mean(current_column)
253253
std = np.std(current_column)
254-
threshold = 3
255-
256-
254+
threshold = 3
255+
257256
for i in range(len(current_column)):
258257
current_item = current_column[i]
259258
z = (current_item - mean) / std
260259
if z >= threshold:
261-
if(i not in outlier_index):
260+
if (i not in outlier_index):
262261
outlier_index.append(i)
263-
if(method == "mean"):
262+
if (method == "mean"):
264263
df.at[i, column] = round(mean, 2)
265-
if(method == "median"):
264+
if (method == "median"):
266265
df.at[i, column] = np.median(current_column)
267-
268-
269-
if(method == "trim"):
266+
267+
if (method == "trim"):
270268
df = df.drop(outlier_index)
271-
269+
272270
df.index = range(len(df))
273271
return df
274272

275273

276-
def scale(dataframe, columns=None, scaler="standard"):
274+
def scale(dataframe, columns, scaler="standard"):
277275
"""
278276
A function to scale features either by using standard scaler or minmax scaler method
279277
@@ -299,15 +297,22 @@ def scale(dataframe, columns=None, scaler="standard"):
299297
>> from eda_utils_py import scale
300298
301299
>> data = pd.DataFrame({
302-
>> 'SepalLengthCm':[5.1, 4.9, 4.7],
303-
>> 'SepalWidthCm':[1.4, 1.4, 1.3],
304-
>> 'PetalWidthCm:[0.2, 0.2, 0.2],
300+
>> 'SepalLengthCm':[1, 0, 0, 3, 4],
301+
>> 'SepalWidthCm':[4, 1, 1, 0, 1],
302+
>> 'PetalWidthCm:[2, 0, 0, 2, 1],
305303
>> 'Species':['Iris-setosa','Iris-virginica', 'Iris-germanica']
306304
>> })
307305
308306
>> numerical_columns = ['SepalLengthCm','SepalWidthCm','PetalWidthCm']
309307
310308
>> scale(data, numerical_columns, scaler="minmax")
309+
310+
SepalLengthCm SepalWidthCm PetalWidthCm
311+
0 0.25 1.00 1.0
312+
1 0.00 0.25 0.0
313+
2 0.00 0.25 0.0
314+
3 0.75 0.00 1.0
315+
4 1.00 0.25 0.5
311316
"""
312317

313318
# Check if input data is of pd.DataFrame type
@@ -370,9 +375,10 @@ def _standardize(dataframe):
370375
The data frame to be used for EDA.
371376
Returns
372377
-------
373-
self : object
378+
res : pandas.core.frame.DataFrame
374379
Scaled dataset
375380
"""
381+
376382
res = dataframe.copy()
377383
for feature_name in dataframe.columns:
378384
mean = dataframe[feature_name].mean()
@@ -398,7 +404,7 @@ def _minmax(dataframe):
398404
The data frame to be used for EDA.
399405
Returns
400406
-------
401-
self : object
407+
res : pandas.core.frame.DataFrame
402408
Scaled dataset
403409
"""
404410

@@ -407,4 +413,5 @@ def _minmax(dataframe):
407413
max = dataframe[feature_name].max()
408414
min = dataframe[feature_name].min()
409415
res[feature_name] = (dataframe[feature_name] - min) / (max - min)
410-
return res
416+
417+
return res

images/cor_map.output.png

34.3 KB
Loading

0 commit comments

Comments
 (0)