dar or دار which means house in Arabic is a simple semi-supervised approach for creating huggingface data script loaders. Here is an example of creating a loading script for a simple dataset:
demo.mp4
The main interface can be ran using the following commands
streamlit run app.pyMainly after entering the dataset name, the user will be prompted to enter the dataset link. The user can either enter one link or mutliple links separated by comma. The supported links are the following
GitHubThe user can enter a link from GitHub without usingrawas it will be converted automatically. The user can also either choose to enter a link to the repository in the following formathttps://github.com/user/repowhich will download and extract the full directory orhttps://github.com/user/repo/foulderwhich will download all the files from that foulders as individual links.Google DriveThe user can also enter a link from google drive in the following formhttps://drive.google.com/file/d/id/viewwhich will directly download and extract the foulder to the local disk. Google sheets could also be used, the can be provided in the same formathttps://docs.google.com/spreedshots/d/id/view. You can test with the following gooel sheet example.Direct linksThe user can enter direct links for files i.ehttps://domain/**.extwith any extension and the file will be downloaded. Multilple links can be concatenated using commahttps://domain/file1.ext,https://domain/file2.ext,...,https://domain/filen.ext
The user can use glob structures to filter out some files from being used in the dataset. For example, the user is prompted with Enter an input structure the user can enter something like 'foulder/**.txt which will include text files with the extension .txt from the foulder. The user will be prompted to enter multiple glob structures unless an empty Enter key is pressed. Multiple glob strucutres are used for datasets that have inputs and multiple targets like machine translation, summarization, speech transcription, etc.
The user will enter the file type when asked for File Type, the supported file formats can be one of the following
txtmainly for reading the file as a whole or separated by lines. To differentiate between such options the user entersSet Lineswhich ifythe file will be separated into multiple lines or ifnit will be read as a whole.csvthis is used for files with sepcial separator, for example.tsvand.txtcan be part of such family if a special separator is used. The program will try to guess the best separator but the user can also choose the separator using the commandCSV Separator. The user can enter the separtors astab,,,;,|, etc.jsoncan be used for dictionary like files. The use can chooseSet Linesas well which will decide whether to split the file by new lines or read the file as a whole. Also some datasets can have a parent dictionary for example{'data':{'col1': [...], 'col2': [...]}}, to support that the user canJson Keywhcih isdatain this example.xmlcan be used for files that contain tags, for examplehtmlfiles. The user will be prompted to enter the column names for example<s>this is good</s><l>positive</l> .....then upon getting the promptXML Columnsthe user can chooses,lsuch tags as columns.xlsxused forexcelfile formats.wavthis is used for audio files likemp3,wavfiles. Upon choosing that the program will automatically create the following features as columns{'audio':np.array(...)}jpgthis is used for image files likejpg,pngfiles. Upon choosing that the program will automatically create the following features as columns{'image':np.array(...)}
All the files will be processed using pandas. The user can modify some contents when prompted to
Skipped rowsthis is used to skip some lines from the beginning of all the files. Mainly used to remove some metadata that is usually put as the header of files. The user can enter0which indicates that no lines will be skipped.Headersused to deal with files that have no column names, the user can set thatFalseand enter the column names in the next step.New Column Namesused to creat different names for the columns or add columns if non exist.Label Column Nameused to choose the column that contains the labels. For example in sentiment anlaysis we will have the contents in a column aspositiveornegative. The user can put the name of that column to recognize that as the label.datasetswill convert that to an integer which can be easily procssed by nlp model pipleines.push to hub:used to upload the dataset to hub. The file will uploaded to the following directoryhf/DATASET_NAMEwherehfcan be specified using the argument--hf
