The rapid advancement of Artificial Intelligence(AI) has achieved remarkable progress in various fields such as image editing, audio generation, and video manipulation. However, it has also introduced new security threats, including deepfake speech and voice spoofing. This paper proposes a multi-feature deep learning based AI synthetic speech detection method capable of addressing these threats with high accuracy. The ASVspoof 2021 dataset's Logical Access(LA) and Deepfake(DF) data were used for training and testing. The proposed system utilizes two audio features, Mel-Spectrogram and MFCC(Mel-Frequency Cepstral Coefficients), to convert audio data into visual and sequential forms for training and inference. To demonstrate the superiority of the proposed method, a comparative analysis was conducted with various models such as CNN, BiLSTM, Transformer, and ensemble methods. Experimental results showed that the multi-feature fusion model outperformed single and ensemble models. The proposed multi-feature fusion model, which combines ConvNeXt-base and BiLSTM using a Late Fusion approach, achieved the highest performance with an accuracy of 98.44 %. The method proposed in this paper is expected to serve as a key technology in future AI deepfake synthetic speech detection systems.