Update dependency datasets to v2.19.1
This MR contains the following updates:
Package | Update | Change |
---|---|---|
datasets | minor |
==2.14.5 -> ==2.19.1
|
Release Notes
huggingface/datasets (datasets)
v2.19.1
Bug fixes
- Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871
Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.19.1
v2.19.0
Dataset Features
- Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531
- convert to a Polars dataframe using
.to_polars()
;import polars as pl from datasets import load_dataset ds = load_dataset("DIBT/10k_prompts_ranked", split="train") ds.to_polars() \ .groupby("topic") \ .agg(pl.len(), pl.first()) \ .sort("len", descending=True)
- Use Polars formatting to return Polars objects when accessing a dataset:
ds = ds.with_format("polars") ds[:10].group_by("kind").len()
- convert to a Polars dataframe using
- Add
fsspec
support forto_json
,to_csv
, andto_parquet
by @alvarobartt in https://github.com/huggingface/datasets/pull/6096- Save on HF in any file format:
ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl") ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv") ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
- Save on HF in any file format:
- Add
mode
parameter toImage
feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735- Set images to be read in a certain mode like "RGB"
dataset = dataset.cast_column("image", Image(mode="RGB"))
- Set images to be read in a certain mode like "RGB"
- Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
- run command to open a MR in script-based dataset to convert it to Parquet: datasets-cli convert_to_parquet <dataset_id>
- Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
- same as IterableDataset.take and IterableDataset.skip
ds = ds.take(10) # take only the first 10 examples
- same as IterableDataset.take and IterableDataset.skip
General improvements and bug fixes
- Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in https://github.com/huggingface/datasets/pull/6713
- fix CastError pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6712
- Expand no-code dataset info with datasets-server info by @mariosasko in https://github.com/huggingface/datasets/pull/6714
- Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in https://github.com/huggingface/datasets/pull/6715
- Fix concurrent script loading with force_redownload by @lhoestq in https://github.com/huggingface/datasets/pull/6718
- get_dataset_default_config_name docstring by @lhoestq in https://github.com/huggingface/datasets/pull/6723
- Deprecate Beam API and download from HF GCS bucket by @mariosasko in https://github.com/huggingface/datasets/pull/6474
- Deprecate Pandas builder by @mariosasko in https://github.com/huggingface/datasets/pull/6730
- Using a registry instead of calling globals for fetching feature types by @psmyth94 in https://github.com/huggingface/datasets/pull/6727
- Update torch_formatter.py by @VarunNSrivastava in https://github.com/huggingface/datasets/pull/6402
- Improve default patterns resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6704
- Transpose images with EXIF Orientation tag by @mariosasko in https://github.com/huggingface/datasets/pull/6739
- Fix missing download_config in get_data_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6742
- Allow null values in dict columns by @mariosasko in https://github.com/huggingface/datasets/pull/6743
- Fix fsspec tqdm callback by @lhoestq in https://github.com/huggingface/datasets/pull/6749
- chore(deps): bump fsspec by @shcheklein in https://github.com/huggingface/datasets/pull/6747
- Fix offline mode with single config by @lhoestq in https://github.com/huggingface/datasets/pull/6741
- Remove deprecated code by @Wauplin in https://github.com/huggingface/datasets/pull/6761
- fixing the issue 6755(small typo) by @JINO-ROHIT in https://github.com/huggingface/datasets/pull/6767
-
remove_columns
/rename_columns
doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772 - Fix CI by @mariosasko in https://github.com/huggingface/datasets/pull/6780
- rename datasets-server to dataset-viewer by @severo in https://github.com/huggingface/datasets/pull/6785
- Install dependencies with
uv
in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779 - Fix cache conflict in
_check_legacy_cache2
by @lhoestq in https://github.com/huggingface/datasets/pull/6792 - Fix typo in docs (upload CLI) by @Wauplin in https://github.com/huggingface/datasets/pull/6802
- fix
DatasetBuilder._split_generators
incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799 - #6791 Improve type checking around FAISS by @Dref360 in https://github.com/huggingface/datasets/pull/6803
- Fix --repo-type order in cli upload docs by @lhoestq in https://github.com/huggingface/datasets/pull/6804
- Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in https://github.com/huggingface/datasets/pull/6806
- Fix cache path to snakecase for
CachedDatasetModuleFactory
andCache
by @izhx in https://github.com/huggingface/datasets/pull/6754 - Multithreaded downloads by @lhoestq in https://github.com/huggingface/datasets/pull/6794
- Remove
os.path.relpath
inresolve_patterns
by @mariosasko in https://github.com/huggingface/datasets/pull/6815 - Extract data on the fly in packaged builders by @mariosasko in https://github.com/huggingface/datasets/pull/6784
- add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in https://github.com/huggingface/datasets/pull/6811
- Support indexable objects in
Dataset.__getitem__
by @mariosasko in https://github.com/huggingface/datasets/pull/6817 - Make convert_to_parquet CLI command create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6809
- Fix parquet export infos by @lhoestq in https://github.com/huggingface/datasets/pull/6822
New Contributors
- @VarunNSrivastava made their first contribution in https://github.com/huggingface/datasets/pull/6402
- @shcheklein made their first contribution in https://github.com/huggingface/datasets/pull/6747
- @JINO-ROHIT made their first contribution in https://github.com/huggingface/datasets/pull/6767
- @JonasLoos made their first contribution in https://github.com/huggingface/datasets/pull/6799
- @izhx made their first contribution in https://github.com/huggingface/datasets/pull/6754
- @Modexus made their first contribution in https://github.com/huggingface/datasets/pull/6811
Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0
v2.18.0
Dataset features
- Make JSON builder support an array of strings by @albertvillanova in https://github.com/huggingface/datasets/pull/6696
- Base parquet batch_size on parquet row group size by @lhoestq in https://github.com/huggingface/datasets/pull/6701
- Faster cold start for streaming
- Change default compression argument for JsonDatasetWriter by @Rexhaif in https://github.com/huggingface/datasets/pull/6659
- Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in https://github.com/huggingface/datasets/pull/6660
- fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in https://github.com/huggingface/datasets/pull/6687
- Support latest fsspec up to 2024.2.0
General improvements and bug fixes
- Fix for Incorrect ex_iterable used with multi num_worker by @kq-chen in https://github.com/huggingface/datasets/pull/6582
- Previously using PyTorch DDP and
num_workers
could lead to incorrect shards assignments to workers and cause errors
- Previously using PyTorch DDP and
- Fix imagefolder dataset url by @mariosasko in https://github.com/huggingface/datasets/pull/6683
- Improve error message for gated datasets on load by @lewtun in https://github.com/huggingface/datasets/pull/6684
- Updated Quickstart Notebook link by @Codeblockz in https://github.com/huggingface/datasets/pull/6685
- Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in https://github.com/huggingface/datasets/pull/6693
- Faster
xlistdir
by @mariosasko in https://github.com/huggingface/datasets/pull/6698 - Update GitHub Actions to Node 20 by @albertvillanova in https://github.com/huggingface/datasets/pull/6682
- Update release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/6681
- Pass through information about location of cache directory. by @stridge-cruxml in https://github.com/huggingface/datasets/pull/6677
- Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in https://github.com/huggingface/datasets/pull/6665
- Update ruff by @lhoestq in https://github.com/huggingface/datasets/pull/6706
- Silence ruff deprecation messages by @mariosasko in https://github.com/huggingface/datasets/pull/6707
- fix: show correct package name to install biopython by @BioGeek in https://github.com/huggingface/datasets/pull/6662
- Fix data_files when passing data_dir by @lhoestq in https://github.com/huggingface/datasets/pull/6705
- Release: 2.18.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6708
New Contributors
- @Codeblockz made their first contribution in https://github.com/huggingface/datasets/pull/6685
- @gzbfgjf2 made their first contribution in https://github.com/huggingface/datasets/pull/6693
- @stridge-cruxml made their first contribution in https://github.com/huggingface/datasets/pull/6677
- @pmrowla made their first contribution in https://github.com/huggingface/datasets/pull/6687
- @BioGeek made their first contribution in https://github.com/huggingface/datasets/pull/6662
- @Rexhaif made their first contribution in https://github.com/huggingface/datasets/pull/6659
- @mohalisad made their first contribution in https://github.com/huggingface/datasets/pull/6660
- @kq-chen made their first contribution in https://github.com/huggingface/datasets/pull/6582
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0
v2.17.1
Bug Fixes
- Revert the changes in
arrow_writer.py
from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664 - Remove deprecated verbose parameter from CSV builder by @albertvillanova in https://github.com/huggingface/datasets/pull/6672
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1
v2.17.0
Dataset Features
- [WebDataset] Audio support and bug fixes by @lhoestq in https://github.com/huggingface/datasets/pull/6573
- Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in https://github.com/huggingface/datasets/pull/6464
- Support data_dir parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6634
- Support push_to_hub without org/user to default to logged-in user by @albertvillanova in https://github.com/huggingface/datasets/pull/6629
- Allow concatenation of datasets with mixed structs by @Dref360 in https://github.com/huggingface/datasets/pull/6587
General improvements and bug fixes
- Fix parallel downloads for datasets without scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6551
- Fix imagefolder with one image by @lhoestq in https://github.com/huggingface/datasets/pull/6556
- Fix tests based on datasets that used to have scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6574
- remove eli5 test by @lhoestq in https://github.com/huggingface/datasets/pull/6583
- [IterableDataset] Fix
drop_last_batch
in map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575 - Support standalone yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6557
- Drop redundant None guard. by @xkszltl in https://github.com/huggingface/datasets/pull/6596
- fix os.listdir return name is empty string by @d710055071 in https://github.com/huggingface/datasets/pull/6581
- Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in https://github.com/huggingface/datasets/pull/6617
- Dedicated RNG object for fingerprinting by @mariosasko in https://github.com/huggingface/datasets/pull/6606
- Migrate from
setup.cfg
topyproject.toml
by @mariosasko in https://github.com/huggingface/datasets/pull/6619 - keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in https://github.com/huggingface/datasets/pull/6586
- Read GeoParquet files using parquet reader by @weiji14 in https://github.com/huggingface/datasets/pull/6508
- Use schema metadata only if it matches features by @lhoestq in https://github.com/huggingface/datasets/pull/6616
- Raise error on bad split name by @lhoestq in https://github.com/huggingface/datasets/pull/6626
- Disable
tqdm
bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627 - Add
with_rank
param toDataset.filter
by @mariosasko in https://github.com/huggingface/datasets/pull/6608 - Bump max range of dill to 0.3.8 by @ringohoffman in https://github.com/huggingface/datasets/pull/6630
- Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/6631
- Faster webdataset streaming by @lhoestq in https://github.com/huggingface/datasets/pull/6578
- Multi gpu docs by @lhoestq in https://github.com/huggingface/datasets/pull/6550
- dataset viewer requires no-script by @severo in https://github.com/huggingface/datasets/pull/6633
- Make split slicing consistent with list slicing by @mariosasko in https://github.com/huggingface/datasets/pull/5891
- Do not use Parquet exports if revision is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/6555
- Make CLI test support multi-processing by @albertvillanova in https://github.com/huggingface/datasets/pull/6628
- Fix reload cache with data dir by @lhoestq in https://github.com/huggingface/datasets/pull/6632
- Fix array cast/embed with null values by @mariosasko in https://github.com/huggingface/datasets/pull/6283
- Faster column validation and reordering by @psmyth94 in https://github.com/huggingface/datasets/pull/6636
- Better multi-gpu example by @lhoestq in https://github.com/huggingface/datasets/pull/6646
- Fix missing info when loading some datasets from Parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6635
- Minor multi gpu doc improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6649
- Document usage of hfh cli instead of git by @lhoestq in https://github.com/huggingface/datasets/pull/6648
New Contributors
- @xkszltl made their first contribution in https://github.com/huggingface/datasets/pull/6596
- @kkoutini made their first contribution in https://github.com/huggingface/datasets/pull/6464
- @JochenSiegWork made their first contribution in https://github.com/huggingface/datasets/pull/6586
- @weiji14 made their first contribution in https://github.com/huggingface/datasets/pull/6508
- @ringohoffman made their first contribution in https://github.com/huggingface/datasets/pull/6630
- @psmyth94 made their first contribution in https://github.com/huggingface/datasets/pull/6636
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0
v2.16.1
Bug fixes
- Fix dl_manager.extract returning FileNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6543
- Fix bug causing FileNotFoundError when passing a relative directory as
cache_dir
toload_dataset
- Fix bug causing FileNotFoundError when passing a relative directory as
- Fix custom configs from script by @lhoestq in https://github.com/huggingface/datasets/pull/6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g.
load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1
v2.16.0
Security features
- Add trust_remote_code argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
https://hf.co/datasets/<repo_id>
. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argumenttrust_remote_code=True
. - Passing
trust_remote_code=True
will be mandatory to load these datasets from the next major release ofdatasets
. - Using the environment variable
HF_DATASETS_TRUST_REMOTE_CODE=0
you can already disable custom code by default without waiting for the next release ofdatasets
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
- Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at
https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet
Features
- Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
- Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
- Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
- This speeds up the
load_dataset
step that lists the data files of big repositories (up to x100) but requireshuggingface_hub
0.20 or newer - Fix
load_dataset
that used to reload data from cache even if the dataset was updated on Hugging Face - Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets:
~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from
datasets
2.15 (using the old scheme) are still reloaded from cache
- This speeds up the
General improvements and bug fixes
- Remove unused argument in
_get_data_files_patterns
by @lhoestq in https://github.com/huggingface/datasets/pull/6343 - Set
usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414 - Use
ruff
for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434 - Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in https://github.com/huggingface/datasets/pull/6431
- Fix multi gpu map example by @lhoestq in https://github.com/huggingface/datasets/pull/6415
- Better
tqdm
wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433 - Remove
Table.__getstate__
andTable.__setstate__
by @LZHgrla in https://github.com/huggingface/datasets/pull/6444 - Use
filelock
package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445 - Fix metadata file resolution when inferred pattern is
**
by @mariosasko in https://github.com/huggingface/datasets/pull/6449 - Update hub-docs reference by @mishig25 in https://github.com/huggingface/datasets/pull/6453
- Refactor
dill
logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454 - Don't require trust_remote_code in inspect_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6456
- [docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/datasets/pull/6424
- Missing DatasetNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6462
- Disable benchmarks in MRs by @lhoestq in https://github.com/huggingface/datasets/pull/6463
- More robust temporary directory deletion by @mariosasko in https://github.com/huggingface/datasets/pull/6426
- Fix shard retry mechanism in
push_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6461 - Use auth to get parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6468
- Remove delete doc CI by @lhoestq in https://github.com/huggingface/datasets/pull/6471
- Fix CI quality by @albertvillanova in https://github.com/huggingface/datasets/pull/6473
- Fix PermissionError on Windows CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6477
- More robust preupload retry mechanism by @mariosasko in https://github.com/huggingface/datasets/pull/6479
- Add IterableDataset
__repr__
by @lhoestq in https://github.com/huggingface/datasets/pull/6480 - Fix max lock length on unix by @lhoestq in https://github.com/huggingface/datasets/pull/6482
- Fix ArrayXD YAML conversion by @mariosasko in https://github.com/huggingface/datasets/pull/6168
- Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6486
- Fix deprecation warning when building conda package by @albertvillanova in https://github.com/huggingface/datasets/pull/6425
- Make push_to_hub return CommitInfo by @albertvillanova in https://github.com/huggingface/datasets/pull/6492
- docs: add reference Git over SSH by @severo in https://github.com/huggingface/datasets/pull/6499
- Fallback on dataset script if user wants to load default config by @lhoestq in https://github.com/huggingface/datasets/pull/6498
- Don't expand_info in HF glob by @lhoestq in https://github.com/huggingface/datasets/pull/6469
- Fix streaming xnli by @lhoestq in https://github.com/huggingface/datasets/pull/6503
- Pickle support for
torch.Generator
objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502 - Enable setting config as default when push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6500
- Better cast error when generating dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6509
- Replace
list_files_info
withlist_repo_tree
inpush_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6510 - Remove deprecated HfFolder by @lhoestq in https://github.com/huggingface/datasets/pull/6512
- Support huggingface-hub pre-releases by @albertvillanova in https://github.com/huggingface/datasets/pull/6516
- Support push_to_hub canonical datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6519
- Support commit_description parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6520
- fix get_metadata_patterns function args error by @d710055071 in https://github.com/huggingface/datasets/pull/6518
- Fix metrics dead link by @qgallouedec in https://github.com/huggingface/datasets/pull/6491
- fix tests by @lhoestq in https://github.com/huggingface/datasets/pull/6523
- Cache backward compatibility with 2.15.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6514
- Preserve order of configs and splits when using Parquet exports by @albertvillanova in https://github.com/huggingface/datasets/pull/6526
New Contributors
- @LZHgrla made their first contribution in https://github.com/huggingface/datasets/pull/6444
- @d710055071 made their first contribution in https://github.com/huggingface/datasets/pull/6518
Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0
v2.15.0
What's Changed
- Fix typo in Audio dataset documentation by @prassanna-ravishankar in https://github.com/huggingface/datasets/pull/6222
- Add push_to_hub with multiple configs docs by @lhoestq in https://github.com/huggingface/datasets/pull/6226
- Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in https://github.com/huggingface/datasets/pull/6228
- Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6233
- Don't skip hidden files in
dl_manager.iter_files
when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230 - Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6223
- Remove unused global variables in
audio.py
by @mariosasko in https://github.com/huggingface/datasets/pull/6241 - Improve error message for missing function parameters by @suavemint in https://github.com/huggingface/datasets/pull/6232
- Fix cast from fixed size list to variable size list by @mariosasko in https://github.com/huggingface/datasets/pull/6243
- Update create_dataset.mdx by @EswarDivi in https://github.com/huggingface/datasets/pull/6247
- [DOCS] Fix typo: Elasticsearch by @leemthompo in https://github.com/huggingface/datasets/pull/6258
- Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in https://github.com/huggingface/datasets/pull/6251
- Temporarily pin tensorflow < 2.14.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6264
- Fix CI 404 errors by @albertvillanova in https://github.com/huggingface/datasets/pull/6262
- Remove
apache_beam
import inBeamBasedBuilder._save_info
by @mariosasko in https://github.com/huggingface/datasets/pull/6265 - Improve documentation of dataset.from_generator by @hartmans in https://github.com/huggingface/datasets/pull/6281
- Fix parquet columns argument in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/6295
- Doc readme improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6298
- Unpin
tensorflow
maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301 - Unpin
jax
maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300 - Fix ArrayXD cast by @mariosasko in https://github.com/huggingface/datasets/pull/6297
- Reduce the number of commits in
push_to_hub
by @mariosasko in https://github.com/huggingface/datasets/pull/6269 - Fix typo in code example in docs by @bryant1410 in https://github.com/huggingface/datasets/pull/6307
- Update README.md by @smty2018 in https://github.com/huggingface/datasets/pull/6304
- Deterministic set hash by @lhoestq in https://github.com/huggingface/datasets/pull/6318
- docs: resolving namespace conflict, refactored variable by @smty2018 in https://github.com/huggingface/datasets/pull/6312
- Fix typos by @python273 in https://github.com/huggingface/datasets/pull/6321
- Fix commit message formatting in multi-commit uploads by @qgallouedec in https://github.com/huggingface/datasets/pull/6313
- Temporarily pin fsspec < 2023.10.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6331
- Unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/6336
- Fix use_dataset.mdx by @angel-luis in https://github.com/huggingface/datasets/pull/6351
- Add
fsspec
version to thedatasets-cli env
command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356 - Expanduser in save_to_disk() by @Unknown3141592 in https://github.com/huggingface/datasets/pull/6098
- Fix time measuring snippet in docs by @mariosasko in https://github.com/huggingface/datasets/pull/6367
- Temporarily pin pyarrow < 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6375
- Fix typo in
Dataset.map
docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373 - Avoid redundant warning when encoding NumPy array as
Image
by @mariosasko in https://github.com/huggingface/datasets/pull/6379 - Replace deprecated license_file in setup.cfg by @albertvillanova in https://github.com/huggingface/datasets/pull/6332
- Minor release step improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6339
- Fix dependency conflict within CI build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/6411
- Remove redundant condition in builders by @albertvillanova in https://github.com/huggingface/datasets/pull/6398
- Handle future deprecation argument by @winglian in https://github.com/huggingface/datasets/pull/6390
- Remove token value from warnings by @mariosasko in https://github.com/huggingface/datasets/pull/6418
- Rename audio_classificiation.py to audio_classification.py by @carlthome in https://github.com/huggingface/datasets/pull/6416
- Add pyarrow-hotfix to release docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6421
- Simplify filesystem logic by @mariosasko in https://github.com/huggingface/datasets/pull/6362
- Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/6423
New Contributors
- @prassanna-ravishankar made their first contribution in https://github.com/huggingface/datasets/pull/6222
- @NinoRisteski made their first contribution in https://github.com/huggingface/datasets/pull/6233
- @suavemint made their first contribution in https://github.com/huggingface/datasets/pull/6232
- @EswarDivi made their first contribution in https://github.com/huggingface/datasets/pull/6247
- @leemthompo made their first contribution in https://github.com/huggingface/datasets/pull/6258
- @hartmans made their first contribution in https://github.com/huggingface/datasets/pull/6281
- @smty2018 made their first contribution in https://github.com/huggingface/datasets/pull/6304
- @python273 made their first contribution in https://github.com/huggingface/datasets/pull/6321
- @angel-luis made their first contribution in https://github.com/huggingface/datasets/pull/6351
- @Unknown3141592 made their first contribution in https://github.com/huggingface/datasets/pull/6098
- @winglian made their first contribution in https://github.com/huggingface/datasets/pull/6390
- @carlthome made their first contribution in https://github.com/huggingface/datasets/pull/6416
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0
v2.14.7
Bug Fixes
- Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in https://github.com/huggingface/datasets/pull/6346
- Fix python formatting for complex types in format_table by @mariosasko in https://github.com/huggingface/datasets/pull/6368
- Support pyarrow 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6378
- Do not try to download from HF GCS for generator by @yundai424 in https://github.com/huggingface/datasets/pull/6372
- Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in https://github.com/huggingface/datasets/pull/6404
New Contributors
- @cwallenwein made their first contribution in https://github.com/huggingface/datasets/pull/6346
- @yundai424 made their first contribution in https://github.com/huggingface/datasets/pull/6372
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7
v2.14.6
What's Changed
- Ignore dataset_info.json in data files resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6224
- Check builder cls default config name in inspect by @lhoestq in https://github.com/huggingface/datasets/pull/6253
- Add support for fsspec>=2023.9.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6244
- Create DefunctDatasetError by @albertvillanova in https://github.com/huggingface/datasets/pull/6286
- Fix get_data_patterns for directories with the word data twice by @albertvillanova in https://github.com/huggingface/datasets/pull/6309
- Fix loading Hub datasets with CSV metadata file by @albertvillanova in https://github.com/huggingface/datasets/pull/6316
- datasets.filesystems: fix is_remote_filesystems by @ap-- in https://github.com/huggingface/datasets/pull/6334
- Pin upper version of fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/6337
- Fix regex get_data_files formatting for base paths by @ZachNagengast in https://github.com/huggingface/datasets/pull/6322
New Contributors
- @ap-- made their first contribution in https://github.com/huggingface/datasets/pull/6334
- @ZachNagengast made their first contribution in https://github.com/huggingface/datasets/pull/6322
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6
Configuration
-
If you want to rebase/retry this MR, check this box
This MR has been generated by Renovate Bot.