Setup your environment
Installed Microsoft VS Code
Get a Github education account
Installed extensions like Github Copilot
, Data Wrangler
on VS Code.
Use Data Diagram tools like SQLDraw
Visually scan datasets
Examine each dataset visually to identify patterns. E.g. What is EmpCode? Is it EmployeeCode?
Change easily readable variable names by using a widely accepted naming convention.
Using a tool like Excel or even NotePad
Rename Empcode
to employeeCode
or EmployeeCode
or employee_code
, so you understand it better.
Using DataFrames in Python
# Rename columns in the DataFrames for a better naming convention
df_employee.rename(columns={'EmpCode': 'employee_Code', 'Gender': 'gender', 'DateofJoin': 'joined_Date', 'Estate': 'estate_Name'}, inplace=True)
Check for data quality (null, wrongly input data, incomplete records)
Treat missing values
Here you can use a ML classifier like K-Nearest Neighbor
Identify relationships between datasets
Here you can run matching commands in python to identify unique values and match them in different dataframes.
Merge DataFrames
# merge the two dataframes on the basis of EmpCode and ID columns
merged_employee_fields_df = pd.merge(employee_df, fields_df, left_on='employee_Code', right_on='field_Id')